Bug 68181
Summary: | mdadm/raid10/md_mod segfaults during the mirror resync | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | Damian Nowak (spam) |
Component: | MD | Assignee: | Neil Brown (neilb) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | mateusz, neilb, spam |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 3.12.6-1-ARCH | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
dmesg with mdadm resync segfault
kernel panic stacktrace screenshot (incomplete) full dmesg since server start (second time) another kernel panic screenshot with a different stacktrace part visible kernel panic from dmesg saved by systemd-journald kernel panic screenshot |
Created attachment 121011 [details]
kernel panic stacktrace screenshot (incomplete)
Created attachment 121021 [details]
full dmesg since server start (second time)
# uname -a Linux sml.cumulushost.eu 3.12.6-1-ARCH #1 SMP PREEMPT Fri Dec 20 19:39:00 CET 2013 x86_64 GNU/Linux # lsmod Module Size Used by ebt_arp 1905 28 ebt_ip 1646 14 vhost_net 10194 14 vhost 18526 1 vhost_net macvtap 10801 1 vhost_net macvlan 9645 1 macvtap xt_conntrack 3273 1 ipt_REJECT 2449 2 tun 19663 32 vhost_net ip6table_filter 1436 0 ip6_tables 17056 1 ip6table_filter ebtable_nat 1796 1 ebtables 24210 1 ebtable_nat iptable_filter 1496 1 ipt_MASQUERADE 2162 4 bridge 97094 0 iptable_nat 3366 1 stp 1661 1 bridge nf_conntrack_ipv4 9206 2 nf_defrag_ipv4 1411 1 nf_conntrack_ipv4 nf_nat_ipv4 3576 1 iptable_nat llc 3729 2 stp,bridge nf_nat 12335 3 ipt_MASQUERADE,nf_nat_ipv4,iptable_nat nf_conntrack 73690 6 ipt_MASQUERADE,nf_nat,nf_nat_ipv4,xt_conntrack,iptable_nat,nf_conntrack_ipv4 xt_CHECKSUM 1239 1 xt_tcpudp 3151 5 iptable_mangle 1592 1 ip_tables 17282 3 iptable_filter,iptable_mangle,iptable_nat x_tables 17359 13 ip6table_filter,ebt_arp,xt_CHECKSUM,ip_tables,xt_tcpudp,ipt_MASQUERADE,ebt_ip,xt_conntrack,iptable_filter,ebtables,ipt_REJECT,iptable_mangle,ip6_tables kvm_amd 52575 98 evdev 11085 2 kvm 388773 1 kvm_amd microcode 15216 0 mgag200 35408 1 igb 161570 0 ttm 63507 1 mgag200 drm_kms_helper 36286 1 mgag200 ptp 8276 1 igb drm 238206 3 ttm,drm_kms_helper,mgag200 psmouse 85460 0 syscopyarea 3486 1 mgag200 sysfillrect 3818 1 mgag200 sysimgblt 2143 1 mgag200 pps_core 8961 1 ptp dca 5152 1 igb i2c_algo_bit 5399 2 igb,mgag200 pcspkr 2035 0 serio_raw 5049 0 ipmi_si 42664 0 ipmi_msghandler 32144 1 ipmi_si button 4677 0 sp5100_tco 5824 0 amd64_edac_mod 21465 0 edac_core 44969 2 amd64_edac_mod acpi_cpufreq 10779 1 k10temp 3058 0 i2c_piix4 10592 0 shpchp 25465 0 edac_mce_amd 13094 1 amd64_edac_mod i2c_core 24164 6 drm,igb,i2c_piix4,drm_kms_helper,mgag200,i2c_algo_bit processor 24935 1 acpi_cpufreq sd_mod 30821 11 ata_generic 3378 0 pata_acpi 3395 0 ahci 23056 8 pata_atiixp 4819 0 libahci 21674 1 ahci ohci_pci 4040 0 ehci_pci 4000 0 ohci_hcd 23937 1 ohci_pci ehci_hcd 59220 1 ehci_pci libata 170856 5 ahci,pata_acpi,libahci,ata_generic,pata_atiixp usbcore 180136 4 ohci_hcd,ohci_pci,ehci_hcd,ehci_pci scsi_mod 130701 2 libata,sd_mod usb_common 1656 1 usbcore raid10 40470 1 raid1 27812 1 md_mod 106028 4 raid1,raid10 dm_mod 82501 6 ext4 474251 2 crc16 1367 1 ext4 mbcache 6082 1 ext4 jbd2 83504 1 ext4 Created attachment 121031 [details]
another kernel panic screenshot with a different stacktrace part visible
Thanks for the report. The bug was introduced in Linux 3.1 by commit e875ecea266a543e643b19e44cf472f1412708f9 It should be fixed by moving 4 lines of code as shown below. The fact that you hit this bug means that you have a bad block on /dev/sdd2. When md tries to read that block while rebuilding /dev/sdc2 it fails and that leads to the code with the bug. So with the bug fixed, the recovery will simply fail when it gets to that point in the process. However if that were the case I would expect to see evidence of the read error in the kernel logs, yet we don't. So I'm a little confused. Are you able to build a kernel with the follow change applied and tried that? diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index c504e8389e69..bb4334205558 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -3218,10 +3218,6 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, if (j == conf->copies) { /* Cannot recover, so abort the recovery or * record a bad block */ - put_buf(r10_bio); - if (rb2) - atomic_dec(&rb2->remaining); - r10_bio = rb2; if (any_working) { /* problem is that there are bad blocks * on other device(s) @@ -3253,6 +3249,10 @@ static sector_t sync_request(struct mddev *mddev, sector_t sector_nr, mirror->recovery_disabled = mddev->recovery_disabled; } + put_buf(r10_bio); + if (rb2) + atomic_dec(&rb2->remaining); + r10_bio = rb2; break; } } Hi Neil, thanks for digging into my issue. I will apply the patch and let you know. After the error there are a few other entries in dmesg but all of them are irrelevant. Some polkit, sshd and kvm (qemu) logs. So yeah, there are no read errors in the log. Hi Neil, not only did this patch resolve the issue. It made the sync go further. Mirror resync is AFTER 35.4% at this moment, and proceeds. I don't think the disk I sync from is broken. I did `dd if=/dev/disk of=/dev/null` for each disk, and there weren't any read errors. If I am not wrong, mdadm resync works just like `dd` (let me know if I'm wrong), so the conclusion should be there aren't any bad blocks. If I'm true, there has be some NPE in the code you patched. Or maybe the patch you provided ignores the bad block (if it exists), instead of failing? Hi Niels, resync reached 100%. No errors in dmesg. There are 25 bad sectors marked by mdadm in /sys/block/md126/md/dev-sdc2/bad_blocks and /sys/block/md126/md/dev-sdd2/bad_blocks. However, I don't know if there they were already there before I first faced the problem described in the issue, because I wasn't aware of bad_blocks file in /sys. Anyway, mdadm seems to work OK. However, when I try to read one specific file on the disk, I experience kernel panic. This isn't any OS-critical file, just some file with data (qemu sparse image). Reading a file that lays on some bad sectors shouldn't lead to that I guess. ;) ext4, dm_mod, md_mod and raid10 are in the stacktrace. Are you able to identify any problems in the code so kernel panic could be avoided? Thanks for your help. Call Trace: [<ffffffff814ecb53>] dump_stack+0x54/0x8d [<ffffffff81061b4d>] warn_slowpath_common+0x7d/0xa0 [<ffffffff81061bbc>] warn_slowpath_fmt+0x4c/0x50 [<ffffffff811d9574>] bio_advance+0xa4/0xb0 [<ffffffffa00e3480>] md_trim_bio+0x30/0xd0 [md_mod] [<ffffffffa0117c24>] make_request+0x354/0xe90 [raid10] [<ffffffffa00e898c>] md_make_request+0xec/0x290 [md_mod] [<ffffffffa00c51ab>] ? dm_request+0x14b/0x200 [dm_mod] [<ffffffff8126bfe2>] generic_make_request+0xc2/0x110 [<ffffffff8126c0a1>] submit_bio+0x71/0x150 [<ffffffff8119854e>] ? __mem_cgroup_commit_charge+0xce/0x3f0 [<ffffffff811e19df>] do_mpage_readpage+0x48f/0x6c0 [<ffffffff81152cc3>] ? __inc_zone_page_state+0x33/0x40 [<ffffffff81134ee6>] ? add_to_page_cache_locked+0xe6/0x1d0 [<ffffffff811e1cfc>] mpage_readpages+0xec/0x140 [<ffffffffa00344a0>] ? _ext4_get_block+0x1b0/0x1b0 [ext4] [<ffffffffa00344a0>] ? _ext4_get_block+0x1b0/0x1b0 [ext4] [<ffffffff8117d889>] ? alloc_pages_current+0xa9/0x160 [<ffffffffa003294a>] ext4_readpages+0x3a/0x50 [ext4] [<ffffffff81140e76>] __do_page_cache_readahead+0x1b6/0x260 [<ffffffff8114135a>] ondemand_readahead+0x14a/0x2c0 [<ffffffff811415a0>] page_cache_async_readahead+0x80/0xa0 [<ffffffff81135ef8>] generic_file_aio_read+0x468/0x710 [<ffffffff811a2eba>] do_sync_read+0x5a/0x90 [<ffffffff811a347a>] vfs_read+0x9a/0x170 [<ffffffff811a3fc9>] SyS_read+0x49/0xa0 [<ffffffff814fb6ed>] system_call_fastpath+0x1a/0x1f Call Trace: [<ffffffffa04acb88>] sd_prep_fn+0x318/0xb80 [sd_mod] [<ffffffff8126d01b>] blk_peek_request+0xfb/0x2a0 [<ffffffffa012ca90>] scsi_request_fn+0x50/0x4a0 [scsi_mod] [<ffffffff8106f08b>] ? lock_timer_base.isra.35+0x2b/0x50 [<ffffffff812696f3>] __blk_run_queue+0x33/0x40 [<ffffffff8128c95d>] cfq_insert_request+0x34d/0x560 [<ffffffff8129002f>] ? idr_remove.part.6+0xef/0x250 [<ffffffff81266acc>] __elv_add_request+0x1cc/0x310 [<ffffffff8126d4c7>] blk_flush_plug_list+0x1b7/0x210 [<ffffffff8126d914>] blk_finish_plug+0x14/0x40 [<ffffffffa01198e8>] raid10d+0x9f8/0x1600 [raid10] [<ffffffff814f2199>] ? schedule+0x29/0x70 [<ffffffff814f36d6>] ? _raw_spin_lock_irqsave+0x6/0x50 [<ffffffffa00e75fd>] md_thread+0x12d/0x170 [md_mod] [<ffffffff81084d40>] ? wake_up_atomic_t+0x30/0x30 [<ffffffffa00e74d0>] ? md_register_thread+0xe0/0xe0 [md_mod] [<ffffffff81083f20>] kthread+0xc0/0xd0 [<ffffffff81083e60>] ? kthread_create_on_node+0x120/0x120 [<ffffffff814fb63c>] ret_from_fork+0x7c/0xb0 [<ffffffff81083e60>] ? kthread_create_on_node+0x120/0x120 Created attachment 121141 [details]
kernel panic from dmesg saved by systemd-journald
yeah .... that certainly shouldn't happen. It is quite possible that the bad blocks were found during the recovery. I didn't think you would have a bad-block-log configured - you need a very recent version of mdadm for that. That is why I though the badblock would abort the recovery. As you do have the log available the recovery completed. After a recovery you would expect the bad block lists for bother devices to be identical. Could you confirm that /sys/block/md126/md/dev-sdc2/bad_blocks and /sys/block/md126/md/dev-sdd2/bad_blocks contain exactly the same numbers? In the case where these two lists are identical, the code which caused the crash shouldn't be reached at all. But then it shouldn't crash if they are different either. So if you could confirm whether they are the same or different before a crash, that would help me know where to focus my search. Thanks. They are the same. Hi Neil, bad_blocks files were identical. I managed to somehow clean list for sdc2 then I tried to read broken file but kernel still panics. After resync bad_blocks are identical again. I'll attach screenshot from latest kernel panic. This would be the last one, because we must bring this machine back to production. Created attachment 121751 [details]
kernel panic screenshot
OK, I discovered why I couldn't find the problem before. I was looking in the wrong place. I was looking at raid1.c instead of raid10.c. Doh. Following patch should fix it. Just two very small changes: remove an 's' and add an '_irq'. It looks like you are the first person to enable the bad-block log on a raid10 array, and get bad blocks logged. Thanks for the pioneering spirit!! I realise that you want to actually use you machine and not keep testing, but if you are able to confirm at some stage that this makes a difference, I would really appreciate it. diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 73dc8a377522..cc9b777de9d8 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1319,7 +1319,7 @@ read_again: /* Could not read all from this device, so we will * need another r10_bio. */ - sectors_handled = (r10_bio->sectors + max_sectors + sectors_handled = (r10_bio->sector + max_sectors - bio->bi_sector); r10_bio->sectors = max_sectors; spin_lock_irq(&conf->device_lock); @@ -1327,7 +1327,7 @@ read_again: bio->bi_phys_segments = 2; else bio->bi_phys_segments++; - spin_unlock(&conf->device_lock); + spin_unlock_irq(&conf->device_lock); /* Cannot call generic_make_request directly * as that will be queued in __generic_make_request * and subsequent mempool_alloc might block Neil, thanks for having a look at it. Hope this patch prevents other people from having the same problem. ;-) We may give this patch a go but Mateusz already erased the whole data and set up the mdadm array from scratch. In such case we won't know if the bug is fixed or not unfortunately. Fair enough. I have the patch queued to go to Linus tomorrow so we'll assume it is fixed. Thanks again for reporting and testing. @Neil, just out of curiosity - can such things be mocked and unit tested somehow? I am totally unfamiliar with C and available toolkit for this language. I once went to a talk about C unit tests and mocking, but I've never got closer than that. mdadm has a small test suite in the source package, but it doesn't cover situations where bad blocks develop. Maybe it should. |
Created attachment 121001 [details] dmesg with mdadm resync segfault Hey, I am using the latest kernel 3.12.6 built by Arch Linux. Recently one of the disks broke, and I replaced it with a new one. However, the mirror doesn't want to resync because of a problem with mdadm/raid10/md_mod. At some point, which is exactly the same every time, mdadm always segaults with the same stacktrace with md_mod and raid10 listed. After that /proc/mdstat is no longer updated. The machine is usable for some time after this happens, sometimes 5 minutes, sometimes an hour. Eventually it kernel panics with ext4, dm_mod, md_mod and raid10 listed in the stacktrace. The most relevant parts of logs and command outputs are inlined. Full logs are attached. mdadm segfault stacktrace: [ 6176.001354] Call Trace: [ 6176.018596] [<ffffffffa00eb454>] md_do_sync+0x704/0xdf0 [md_mod] [ 6176.036266] [<ffffffff81085e00>] ? wake_up_atomic_t+0x30/0x30 [ 6176.054007] [<ffffffffa00e761d>] md_thread+0x12d/0x170 [md_mod] [ 6176.071752] [<ffffffffa00e74f0>] ? md_register_thread+0xe0/0xe0 [md_mod] [ 6176.089436] [<ffffffff81084fe0>] kthread+0xc0/0xd0 [ 6176.106732] [<ffffffff81084f20>] ? kthread_create_on_node+0x120/0x120 [ 6176.123869] [<ffffffff814fcffc>] ret_from_fork+0x7c/0xb0 [ 6176.141006] [<ffffffff81084f20>] ? kthread_create_on_node+0x120/0x120 (...) [ 6176.194823] RIP [<ffffffffa01146af>] sync_request+0x5ff/0x1990 [raid10] [ 6176.329888] ---[ end trace 630b14e1fb4a7af4 ]--- (...) [ 6176.619024] Call Trace: [ 6176.637172] [<ffffffff814ee4fb>] dump_stack+0x54/0x8d (...) [ 6176.877469] [<ffffffff814f5dc8>] page_fault+0x28/0x30 [ 6176.893671] [<ffffffffa0113bb8>] ? r10buf_pool_free+0xa8/0xc0 [raid10] [ 6176.910082] [<ffffffffa01146af>] ? sync_request+0x5ff/0x1990 [raid10] [ 6176.926580] [<ffffffffa0114680>] ? sync_request+0x5d0/0x1990 [raid10] [ 6176.942758] [<ffffffffa00eb454>] md_do_sync+0x704/0xdf0 [md_mod] [ 6176.958823] [<ffffffff81085e00>] ? wake_up_atomic_t+0x30/0x30 [ 6176.974890] [<ffffffffa00e761d>] md_thread+0x12d/0x170 [md_mod] [ 6176.990953] [<ffffffffa00e74f0>] ? md_register_thread+0xe0/0xe0 [md_mod] [ 6177.007038] [<ffffffff81084fe0>] kthread+0xc0/0xd0 [ 6177.023021] [<ffffffff81084f20>] ? kthread_create_on_node+0x120/0x120 [ 6177.039109] [<ffffffff814fcffc>] ret_from_fork+0x7c/0xb0 [ 6177.055104] [<ffffffff81084f20>] ? kthread_create_on_node+0x120/0x120 [ 6177.071253] ---[ end trace 630b14e1fb4a7af5 ]--- (...) % cat /proc/mdstat Personalities : [raid1] [raid10] md125 : inactive sdd1[1](S) 10477568 blocks super 1.2 md126 : active raid10 sdc2[5] sda2[0] sdd2[4] sdb2[1] 1932289024 blocks super 1.2 512K chunks 2 near-copies [4/3] [UU_U] [=======>.............] recovery = 35.4% (342860848/966144512) finish=5635.3min speed=1843K/sec md127 : active raid1 sda1[0] sdb1[1] 10477440 blocks super 1.2 [2/2] [UU] unused devices: <none> # mdadm -D /dev/md125 /dev/md125: Version : 1.2 Creation Time : Mon Dec 10 14:17:59 2012 Raid Level : raid10 Array Size : 1932289024 (1842.77 GiB 1978.66 GB) Used Dev Size : 966144512 (921.39 GiB 989.33 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Sun Jan 5 17:42:43 2014 State : active, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : near=2 Chunk Size : 512K Rebuild Status : 35% complete Name : archiso:STORE UUID : 4eff1a86:bf500790:976dbff2:46d8aa69 Events : 2292466 Number Major Minor RaidDevice State 0 8 2 0 active sync set-A /dev/sda2 1 8 18 1 active sync set-B /dev/sdb2 5 8 34 2 spare rebuilding /dev/sdc2 4 8 50 3 active sync set-B /dev/sdd2 # for dev in sda sdb sdc sdd; do echo "=> $dev"; smartctl -a /dev/$dev | grep -i 'device model' -B 1; smartctl -a /dev/$dev | grep -i hours; smartctl -a /dev/$dev | grep -i _sector; done => sda Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST1000DM003-9YN162 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 9694 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 9587h+04m+37.325s 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 => sdb Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST1000DM003-9YN162 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 9692 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 9584h+40m+44.948s DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, Error 4 occurred at disk power-on lifetime: 9677 hours (403 days + 5 hours) Error 3 occurred at disk power-on lifetime: 9677 hours (403 days + 5 hours) Error 2 occurred at disk power-on lifetime: 9677 hours (403 days + 5 hours) Error 1 occurred at disk power-on lifetime: 9677 hours (403 days + 5 hours) 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 100 001 000 Old_age Always - 0 => sdc === START OF INFORMATION SECTION === Device Model: WDC WD1003FZEX-00MK2A0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 30 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 => sdd === START OF INFORMATION SECTION === Device Model: HITACHI HUA7210SASUN1.0T 0817GJABRF 9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 20244 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 The hardware is: - 400 days old HP ProLiant DL165 G7 - 2x ST1000DM003-9YN162 - 400 days old - 1x WDC WD1003FZEX-00MK2A0 - 1 day old - 1x HITACHI HUA7210SASUN1.0T - 200 days old in this server, 600 days old just idling - mdadm dies when resyncing from Hitachi to WDC There haven't been any problems with mdadm/raid10/md_mod for a year. Of course it's possible that the disk that data is synced from is broken, and doesn't announce that. However, I don't think md_mod should segfault because of that. The problem first appeared on 3.11.2. I updated to the latest version to see if it still persists on 3.12.6. It does. (Mateusz Kwiatkowski or I will answer all your questions)