Bug 108741

Summary: MD raid crash when copying large files
Product: IO/Storage Reporter: Mattias Månsson (mattias.mansson)
Component: MDAssignee: Neil Brown (neilb)
Status: RESOLVED CODE_FIX    
Severity: normal CC: linux, mattias.mansson, mnl, neilb, shli
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.2.5-1 Subsystem:
Regression: No Bisected commit-id:

Description Mattias Månsson 2015-12-01 18:14:16 UTC
I have recently switched from ubuntu to arch linux and have an MD raid that I moved to the new system. It's a raid 5 with 4x2TB disks with ext3. A few days ago I copied a few large files (3x~15GB) in a screen job. It never finished and I could not access the computer over ssh, so I finally rebooted it.

Today I did the same thing. First I copied one of the files, it went fint. The next one just hanged after a few GB. After this I could not kill the job or copy anything else to the raid. I checked dmesg and found the following:


[315991.664637] ------------[ cut here ]------------
[315991.664667] WARNING: CPU: 2 PID: 257 at drivers/md/raid5.c:4244 break_stripe_batch_list+0x1b9/0x260 [raid456]()
[315991.664672] Modules linked in: iptable_filter cfg80211 rfkill raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx iTCO_wdt evdev gpio_ich iTCO_vendor_support mac_hid raid6_pq coretemp i915 pcspkr md_mod psmouse serio_raw snd_hda_codec_realtek snd_hda_codec_generic i2c_i801 snd_hda_intel r8169 mii lpc_ich video snd_hda_codec drm_kms_helper snd_hda_core snd_hwdep snd_pcm drm snd_timer snd intel_agp i2c_algo_bit shpchp intel_gtt soundcore button acpi_cpufreq processor sch_fq_codel nfsd nfs auth_rpcgss oid_registry nfs_acl lockd grace sunrpc fscache ip_tables x_tables ext4 crc16 mbcache jbd2 sd_mod ata_generic pata_acpi atkbd libps2 ahci libahci ata_piix pata_jmicron uhci_hcd usbcore libata usb_common scsi_mod i8042 serio
[315991.664823] CPU: 2 PID: 257 Comm: md0_raid5 Not tainted 4.2.5-1-ARCH #1
[315991.664829] Hardware name:      /NM10, BIOS 080016  07/19/2011
[315991.664834]  0000000000000000 000000008bdf44d3 ffff8800b667ba98 ffffffff81570d0a
[315991.664844]  0000000000000000 0000000000000000 ffff8800b667bad8 ffffffff810748a6
[315991.664852]  ffff8800b667bb38 ffff8800461599a0 ffff8800b66e5348 ffff880082b74ce0
[315991.664860] Call Trace:
[315991.664876]  [<ffffffff81570d0a>] dump_stack+0x4c/0x6e
[315991.664887]  [<ffffffff810748a6>] warn_slowpath_common+0x86/0xc0
[315991.664895]  [<ffffffff810749da>] warn_slowpath_null+0x1a/0x20
[315991.664907]  [<ffffffffa05fcf39>] break_stripe_batch_list+0x1b9/0x260 [raid456]
[315991.664919]  [<ffffffffa060424c>] handle_stripe+0x9bc/0x2560 [raid456]
[315991.664930]  [<ffffffff812b4c6f>] ? cpumask_next_and+0x2f/0x40
[315991.664943]  [<ffffffffa0605f73>] handle_active_stripes.isra.22+0x183/0x4c0 [raid456]
[315991.664957]  [<ffffffffa060687f>] raid5d+0x49f/0x670 [raid456]
[315991.664973]  [<ffffffffa04550a0>] md_thread+0x130/0x140 [md_mod]
[315991.664984]  [<ffffffff810b4c80>] ? wake_atomic_t_function+0x60/0x60
[315991.664996]  [<ffffffffa0454f70>] ? md_wait_for_blocked_rdev+0x130/0x130 [md_mod]
[315991.665007]  [<ffffffff81092578>] kthread+0xd8/0xf0
[315991.665017]  [<ffffffff810924a0>] ? kthread_worker_fn+0x170/0x170
[315991.665028]  [<ffffffff8157665f>] ret_from_fork+0x3f/0x70
[315991.665038]  [<ffffffff810924a0>] ? kthread_worker_fn+0x170/0x170
[315991.665046] ---[ end trace 6558f4ff075bfa8c ]---


Here is some info about the raid (currently syncing after having to reboot)

$ mdadm -D /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Fri Feb 13 22:16:46 2009
     Raid Level : raid5
     Array Size : 8790799296 (8383.56 GiB 9001.78 GB)
  Used Dev Size : -1
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Dec  1 19:11:48 2015
          State : active, resyncing
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Resync Status : 14% complete

           UUID : 4a529d12:38740437:6b618a7b:b902ea28
         Events : 0.142106

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       65        1      active sync   /dev/sde1
       2       8       49        2      active sync   /dev/sdd1
       3       8       81        3      active sync   /dev/sdf1
Comment 1 Neil Brown 2015-12-17 05:38:11 UTC
>   Used Dev Size : -1

that doesn't look good.

>        Version : 0.90
>     Array Size : 8790799296

that shouldn't happen.
v0.90 metadata can only handle component devices up to 2TB.
These look like 3TB.
But you said they were 2TB... that should provide an array size of about 6TB,
not 8.
(You should really be using 1.0 or 1.2 metadata)

However that it probably tangential.  There are other reports of the same problem, such as https://bugzilla.redhat.com/show_bug.cgi?id=1258153 
so it probably isn't size related.

The WARNING is triggered because one of 14 different bit flags is set in sh->state.
It would really help to know which one....  Can you reproduce the problem easily?  Could you modify break_stripe_batch_list() to something like
  if (WARN_ON_ONCE(....)) printk("sh->state is %x\n", sh->state);
??
Comment 2 Michael Lipp 2015-12-30 09:57:22 UTC
I experience the same problem with a 4.2.0-22-generic (ubuntu) kernel and a raid6 after upgrading from kernel 3.8.0-35.

Dec 29 22:53:15 olymp kernel: [31880.799771] ------------[ cut here ]------------
Dec 29 22:53:15 olymp kernel: [31880.799781] WARNING: CPU: 0 PID: 229 at /build/linux-cRemOf/linux-4.2.0/drivers/md/raid5.c:4246 break_stripe_batch_list+0x1c2/0x270 [raid456]()
Dec 29 22:53:15 olymp kernel: [31880.799782] Modules linked in: vhost_net vhost macvtap macvlan ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables binfmt_misc bridge stp llc intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp input_leds joydev kvm_intel kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul ie31200_edac edac_core shpchp glue_helper lpc_ich 8250_fintek ablk_helper cryptd mac_hid ipmi_ssif tpm_crb ipmi_si ipmi_devintf ipmi_msghandler jc42 coretemp parport_pc ppdev lp parport autofs4 raid10 raid0 multipath linear dm_cache_mq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx hid_generic xor raid6_pq raid1 usbhid hid syscopyarea sysfillrect sysimgblt ttm igb drm_kms_helper e1000e drm dca ptp i2c_algo_bit ahci pps_core libahci video
Dec 29 22:53:15 olymp kernel: [31880.799830] CPU: 0 PID: 229 Comm: md1_raid6 Not tainted 4.2.0-22-generic #27-Ubuntu
Dec 29 22:53:15 olymp kernel: [31880.799831] Hardware name: Supermicro X10SLL-F/X10SLL-F, BIOS 3.0 04/24/2015
Dec 29 22:53:15 olymp kernel: [31880.799833]  0000000000000000 000000000dcbe27f ffff8807f3f77a98 ffffffff817e94c9
Dec 29 22:53:15 olymp kernel: [31880.799835]  0000000000000000 0000000000000000 ffff8807f3f77ad8 ffffffff8107b3d6
Dec 29 22:53:15 olymp kernel: [31880.799837]  0000000000000001 ffff8807f3f539a8 ffff880644d54678 ffff8805613606b0
Dec 29 22:53:15 olymp kernel: [31880.799839] Call Trace:
Dec 29 22:53:15 olymp kernel: [31880.799845]  [<ffffffff817e94c9>] dump_stack+0x45/0x57
Dec 29 22:53:15 olymp kernel: [31880.799859]  [<ffffffff8107b3d6>] warn_slowpath_common+0x86/0xc0
Dec 29 22:53:15 olymp kernel: [31880.799861]  [<ffffffff8107b50a>] warn_slowpath_null+0x1a/0x20
Dec 29 22:53:15 olymp kernel: [31880.799864]  [<ffffffffc0200712>] break_stripe_batch_list+0x1c2/0x270 [raid456]
Dec 29 22:53:15 olymp kernel: [31880.799867]  [<ffffffffc02077d1>] handle_stripe+0x9c1/0x2550 [raid456]
Dec 29 22:53:15 olymp kernel: [31880.799871]  [<ffffffff810ae739>] ? update_curr+0x119/0x150
Dec 29 22:53:15 olymp kernel: [31880.799874]  [<ffffffffc02094f8>] handle_active_stripes.isra.46+0x198/0x4a0 [raid456]
Dec 29 22:53:15 olymp kernel: [31880.799876]  [<ffffffffc01fc8d8>] ? do_release_stripe+0xd8/0x190 [raid456]
Dec 29 22:53:15 olymp kernel: [31880.799879]  [<ffffffffc0209c90>] raid5d+0x490/0x670 [raid456]
Dec 29 22:53:15 olymp kernel: [31880.799882]  [<ffffffff817ebc4c>] ? __schedule+0x36c/0x950
Dec 29 22:53:15 olymp kernel: [31880.799885]  [<ffffffff8165ac69>] md_thread+0x139/0x150
Dec 29 22:53:15 olymp kernel: [31880.799887]  [<ffffffff810bd190>] ? wake_atomic_t_function+0x60/0x60
Dec 29 22:53:15 olymp kernel: [31880.799889]  [<ffffffff8165ab30>] ? find_pers+0x80/0x80
Dec 29 22:53:15 olymp kernel: [31880.799892]  [<ffffffff8109a868>] kthread+0xd8/0xf0
Dec 29 22:53:15 olymp kernel: [31880.799894]  [<ffffffff8109a790>] ? kthread_create_on_node+0x1f0/0x1f0
Dec 29 22:53:15 olymp kernel: [31880.799896]  [<ffffffff817f06df>] ret_from_fork+0x3f/0x70
Dec 29 22:53:15 olymp kernel: [31880.799898]  [<ffffffff8109a790>] ? kthread_create_on_node+0x1f0/0x1f0
Dec 29 22:53:15 olymp kernel: [31880.799900] ---[ end trace 560f8a7f63c8e10b ]---

root@olymp:~# mdadm -D /dev/md1
/dev/md1:
        Version : 0.90
  Creation Time : Fri Oct 16 13:45:25 2009
     Raid Level : raid6
     Array Size : 1904296704 (1816.08 GiB 1950.00 GB)
  Used Dev Size : 952148352 (908.04 GiB 975.00 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Wed Dec 30 10:52:14 2015
          State : active, resyncing 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

  Resync Status : 0% complete

           UUID : xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
         Events : 0.5217958

    Number   Major   Minor   RaidDevice State
       0       8       37        0      active sync   /dev/sdc5
       1       8       21        1      active sync   /dev/sdb5
       2       8        5        2      active sync   /dev/sda5
       3       8       53        3      active sync   /dev/sdd5
Comment 3 Michael Lipp 2015-12-30 10:40:22 UTC
Sorry, looked it up wrong. I upgraded from either 3.16.0-40 or 3.19.0-42. Trying to find out...
Comment 4 Mattias Månsson 2015-12-30 11:20:09 UTC
I have rebuilt my raid with correct alignment and metadata 1.2 and I have no problems at all any more. Could still be a bug that this happens in that kernel version, but at least it seems stable when my problems were fixed.
Comment 5 Michael Lipp 2015-12-31 11:40:17 UTC
Gone back to 3.19.0-42, everything seems to work fine again. Moving to metadata 1.2 is not an (easy) option for me. I'd consider it only if the problem can really be related to the metadata format.
Comment 6 Shaohua Li 2016-01-21 05:12:53 UTC
Could anybody try get the 'sh->state' when this happens as Neil suggested? that would be very helpful to debug this issue.
Comment 7 Mattias Månsson 2016-01-21 07:52:48 UTC
"Unfortunately" it never happens to me after rebuilding, so I can't do it...
Comment 8 Michael Lipp 2016-01-21 09:19:46 UTC
I feel a bit bad about this because I know that you need the information if you cannot find the cause by comparing the source of the kernel versions. But honestly, I consider myself very lucky as my data has survived two such hangs (at least apparently, though there have been some error messages when the raid was resynching). So I'm not really daring enough to reboot with that problematic kernel.
Comment 9 Michael Lipp 2016-02-04 10:14:12 UTC
I spent last week moving my server data to a rebuilt raid6 with metadata 1.2 (still using the 3.19.0 kernel). Everything went fine (it was probably a good idea to rebuild everything using better chunk size and new ext4 filesystems, despite the effort it took). 

/dev/md2:
        Version : 1.2
  Creation Time : Thu Jan 28 12:04:12 2016
     Raid Level : raid6
     Array Size : 1919296512 (1830.38 GiB 1965.36 GB)
  Used Dev Size : 959648256 (915.19 GiB 982.68 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Thu Feb  4 10:56:16 2016
          State : active 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : xxxxxxxxxxxxxxxxxxx
           UUID : xxxxxxxxxxxxxxxxxxx
         Events : 116020

    Number   Major   Minor   RaidDevice State
       6       8       34        0      active sync   /dev/sdc2
       7       8       50        1      active sync   /dev/sdd2
       4       8        2        2      active sync   /dev/sda2
       5       8       18        3      active sync   /dev/sdb2


Yesterday evening I ventured to update the kernel to (Ubuntu current) 4.2.0-27. I shouldn't have done that. This morning the server hung again. Obviously quite horribly because between the breakdown and the reboot there's just a block of "^@^@^@..." in the syslog. So, of course, I cannot really claim that I experience the same problem as with my 0.9-metadata-raid6. But I'd say it's quite probable.
Comment 10 Michael Lipp 2016-02-09 09:32:15 UTC
Since moving to metadata 1.2 didn't fix the problem on my server and sticking with the 3.19.0-42 kernel forever isn't really an option, I tried 4.4.1 (for Ubuntu wily). The server's running for two days without any problems by now. So from my point of view, this can be closed.
Comment 11 Shaohua Li 2016-02-09 17:52:03 UTC
Did resync run in 4.4.1? The problem seems only happen with resync running from original reports.
Comment 12 Michael Lipp 2016-02-11 17:13:16 UTC
I don't think that I had a resync running in 4.4.1 yet. Should I trigger a check or repair? Or should I really fail a drive?
Comment 13 Shaohua Li 2016-02-11 18:57:05 UTC
maybe a repair is enough. if we know how to trigger the bug, it would be easy to fix.
Comment 14 Michael Lipp 2016-02-14 13:14:09 UTC
I did a repair. Got three rebuild events (which I find suprisingly high). But else nothing happened (luckily). This confirms that with metadata 1.2 and kernel 4.4.1 I'm on the safe side, doesn't it? Or are there other bug reports for that combination?

Mattias (above), who doesn't have problems any more since he's using metadata 1.2, never mentioned the version of the kernel he's using. But as he wrote that he'd switched to arch linux, it should be a rather new kernel.

So from my point of view, I still think this can be closed.
Comment 15 Tom Weber 2016-03-01 10:24:44 UTC
Today I had this with 4.4.2
Last Week with 4.4.1.

/dev/md126:
        Version : 1.2
  Creation Time : Fri Jan 16 23:56:25 2015
     Raid Level : raid5
     Array Size : 5860270080 (5588.79 GiB 6000.92 GB)
  Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Tue Mar  1 11:12:31 2016
          State : clean 
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : XXXX:store0  (local to host XXXX)
           UUID : XXXX
         Events : 37121

    Number   Major   Minor   RaidDevice State
       0       8       64        0      active sync   /dev/sde
       1       8       80        1      active sync   /dev/sdf
       3       8       96        2      active sync   /dev/sdg

The System was running fine with 3.18.5 for more than a year (debian 7).
After upgrading to debian 8 and 4.4.1 this happened two days in a row last week.
After working for a week with 4.4.2 it now failed again.

(The system is a regular debian, except for self compiled vanilla kernels)

Its a system in production, so I'm limited regarding tests but I will try to implement the above printk tonight.

But its definitely not fixed in 4.4.2
Comment 16 Tom Weber 2016-03-03 20:10:45 UTC
There it happened again, this time with the output of sh->state.

Note that I was NOT able to trigger it intentionally. I was running big copies and like 10 bonnie++ in parallel without problems... now, 2 days later it just happend:


[190314.621828] WARNING: CPU: 5 PID: 377 at drivers/md/raid5.c:4251 break_stripe_batch_list+0x264/0x280 [raid456]()
[190314.623268] Modules linked in: vhost_net vhost macvtap macvlan af_packet tun binfmt_misc cpufreq_powersave cpufreq_performance cpufreq_conservative cpufreq_ondemand cpufreq_stats bridge i7core_edac fuse 8021q garp stp llc coretemp intel_rapl iosf_mbi x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ipmi_ssif hwmon_vid ext2 iTCO_wdt iTCO_vendor_support sha256_ssse3 sha256_generic fbcon bitblit softcursor ast font tileblit ttm drbg drm_kms_helper ansi_cprng drm fb_sys_fops syscopyarea sysfillrect sysimgblt cfbfillrect cfbimgblt cfbcopyarea fb fbdev aesni_intel aes_x86_64 lrw gf128mul sb_edac glue_helper pcspkr ablk_helper evdev input_leds edac_core cryptd i2c_i801 mei_me ioatdma mei shpchp lpc_ich wmi rtc_cmos ipmi_si 8250_fintek ipmi_msghandler tpm_tis tpm
[190314.632214]  acpi_power_meter acpi_pad button processor loop drbd lru_cache autofs4 ext4 crc16 mbcache jbd2 dm_mod raid1 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic md_mod hid_generic usbhid hid sg sd_mod ehci_pci crc32c_intel igb ahci i2c_algo_bit libahci dca hwmon xhci_pci ptp ehci_hcd libata xhci_hcd pps_core i2c_core fjes
[190314.638185] CPU: 5 PID: 377 Comm: md126_raid5 Not tainted 4.4.3 #2
[190314.639614] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS 1.0b 01/29/2015
[190314.641066]  0000000000000000 ffff880858fcfb70 ffffffff8126ebfe 0000000000000000
[190314.642559]  ffffffffa03b7ccc ffff880858fcfba8 ffffffff81054ff6 ffff88030dd8fa58
[190314.644007]  ffff88001699c5b8 ffff8803defb6530 ffff8807a903ca58 0000000000000000
[190314.645421] Call Trace:
[190314.646782]  [<ffffffff8126ebfe>] dump_stack+0x63/0x85
[190314.648118]  [<ffffffff81054ff6>] warn_slowpath_common+0x86/0xc0
[190314.649452]  [<ffffffff810550ea>] warn_slowpath_null+0x1a/0x20
[190314.650781]  [<ffffffffa03ad784>] break_stripe_batch_list+0x264/0x280 [raid456]
[190314.652125]  [<ffffffffa03b1ee7>] handle_stripe+0x5f7/0x23d0 [raid456]
[190314.653468]  [<ffffffff810856cf>] ? update_curr+0xbf/0x150
[190314.654825]  [<ffffffffa03b3e3b>] handle_active_stripes.isra.44+0x17b/0x400 [raid456]
[190314.656221]  [<ffffffffa03b4426>] raid5d+0x366/0x670 [raid456]
[190314.657631]  [<ffffffff810b5a80>] ? trace_event_raw_event_tick_stop+0x110/0x110
[190314.659069]  [<ffffffffa0346b5b>] md_thread+0x12b/0x130 [md_mod]
[190314.660521]  [<ffffffff81092c10>] ? wait_woken+0x90/0x90
[190314.661985]  [<ffffffffa0346a30>] ? md_wait_for_blocked_rdev+0x100/0x100 [md_mod]
[190314.663484]  [<ffffffff810712cb>] kthread+0xdb/0x100
[190314.664995]  [<ffffffff810711f0>] ? kthread_create_on_node+0x170/0x170
[190314.666533]  [<ffffffff814cbe4f>] ret_from_fork+0x3f/0x70
[190314.668064]  [<ffffffff810711f0>] ? kthread_create_on_node+0x170/0x170
[190314.669621] ---[ end trace 2b6a66b632d9b3a2 ]---
[190314.671287] XXX sh->state is 2042
Comment 17 Neil Brown 2016-03-03 22:17:20 UTC
> XXX sh->state is 2042

Assuming that is hex, the bits are
STRIPE_HANDLE STRIPE_PREREAD_ACTIVE STRIPE_IO_STARTED

STRIPE_PREREAD_ACTIVE is the problem.

That should help focus the bug-search a little - I'll see what I can find.

Thanks.
Comment 18 Tom Weber 2016-03-03 22:52:57 UTC
it is:
[..]
                      (1 << STRIPE_BITMAP_PENDING))) ) printk("XXX sh->state is %lx\n", sh->state);
[..]

Oh, and this was on a different machine than the one i posted md dump in Comment #15

Strange enough I have 2 servers that both crashed 2-3 times during the last 10 days and 3 other servers with a very similar setup and the same kernels (4.4.x) that seem to be running fine (at least so far).
Comment 19 Neil Brown 2016-03-03 23:03:20 UTC
Yes... it is obvious now.

Whenever STRIPE_PREREAD_ACTIVE is set on a stripe, conf->preread_active_stripes is incremented.  When the flag is cleared, it is decremented.

Partial-stripe writes are delayed until there are no preread stripes active.
So if we ever clear STRIPE_PREREAD_ACTIVE without decrementing the counter, the counter will never reach zero and all partial-stripe writes will block forever.

If STRIPE_PREREAD_ACTIVE is set in break_stripe_batch_list, then the set_mask_bits call will clear it without the decremented - hence the warning is followed by a lockup.

When a stripe is added to a batch, the preread_active flag is checked and cleared with the appropriate decrement.  And STRIPE_PREREAD_ACTIVE is never set on a stripe in a batch.  But there is no locking to avoid races here.
So one thread could
A/ check that the stripe is not in a batch
B/ set the flag

and between A and B another thread could add it to a batch and check the flag.

The easiest fix would be to preserve the PREREAD_ACTIVE flag in the set_mask_bits() as the patch below should do.
It might be good to catch the imbalance earlier, but I'm not sure it is important.

so: please apply this fix (just move a single line down) and let us know if it still hangs.  Maybe you could even leave the WARN_ON_ONCE alone so you get the warning - then if the array continues to function after the warning is given, you can be certain that you lost the race but it didn't cause any problems.


diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 704ef7fcfbf8..e3e713b5e766 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4238,7 +4238,6 @@ static void break_stripe_batch_list(struct stripe_head *head_sh,
 		WARN_ON_ONCE(sh->state & ((1 << STRIPE_ACTIVE) |
 					  (1 << STRIPE_SYNCING) |
 					  (1 << STRIPE_REPLACED) |
-					  (1 << STRIPE_PREREAD_ACTIVE) |
 					  (1 << STRIPE_DELAYED) |
 					  (1 << STRIPE_BIT_DELAY) |
 					  (1 << STRIPE_FULL_WRITE) |
@@ -4253,6 +4252,7 @@ static void break_stripe_batch_list(struct stripe_head *head_sh,
 					      (1 << STRIPE_REPLACED)));
 
 		set_mask_bits(&sh->state, ~(STRIPE_EXPAND_SYNC_FLAGS |
+					    (1 << STRIPE_PREREAD_ACTIVE) |
 					    (1 << STRIPE_DEGRADED)),
 			      head_sh->state & (1 << STRIPE_INSYNC));
Comment 20 Tom Weber 2016-03-03 23:58:16 UTC
Great.
I've applied the set_mask_bits part now leaving the WARN_ON_ONCE alone so we can hopefully see a warning without bringing the machine to halt to confirm it's working.

booting the kernel on the two problematic servers now.

if you have any idea how i could try to trigger this let me know.

and many thanks for the quick response!
Comment 21 Michael Lipp 2016-03-04 10:00:16 UTC
Thanks to Tom for doing the research and risking his RAIDs -- as I said, the problem didn't show up any more on my RAID.

I'm wondering... Although this problem doesn't seem to have affected a lot of users (for whatever reason), a suddenly stopping RAID seems like a high priority problem to me (it certainly *felt* like one when it happened to my server). The crucial information to analyse this problem came from inserting a single printk that enhanced a warning message to a warning message with a reason for the warning. 

I'm not doing kernel development, just "ordinary" software development. But I'd certainly make sure in my software that I'd keep that printk statement in my code forever now (actually, I'd have put it there in the first place), in order to never run again into a situation where I can only say "well it is caused by one of 14 flags, I first need to know which one for further analysis".

So, when finally providing a patch for this, wouldn't it make sense to add that printk as well? To be prepared for future problems (though hopefully they never occur).
Comment 22 Neil Brown 2016-03-05 21:50:18 UTC
> wouldn't it make sense to add that printk as well? 

Maybe.

From my perspective, the crucial information to analyse the problem was the WARN_ON message.  Without that we would have been completely in the dark and probably would have had to ask for a git-bisect - which would have taken ages.
With the WARN_ON we knew a lot more and even had a list of bits, one of which was causing a problem.
Had I the time and motivation I probably could have examined those bits one at a time and determined if any could have been causing the problem.  Examining code based on vague maybes like that is difficult and time consuming, but it is possible.

Having the result of the printk certainly helped a lot - cut the amount of work required significantly.  But we only new that might be a good idea because we already had the WARN_ON.

I'm not against making that printk permanent - init_stripe() has similar code which checks some fields then prints them all out if anything is wrong.

But if the goal is improving debugability, then I'm not sure that adding the printk would be my first step.  I would probably find a way to make it easy to extract all the data structures - e.g. the raid5conf and the table of stripes.
In this case, that information would have shown that ->preread_active_stripes was not zero, but none of the stripes in the table had the STRIPE_PREREAD_ACTIVE flag set.  I suspect that would have quickly lead to understanding.

So: maybe.  Thanks for the suggestion.
Comment 23 Michael Lipp 2016-03-06 13:36:52 UTC
It seems that the problem struck my server again after all. I cannot be sure, because the end of the journal was lost (I'm still wondering why this problem affects so few people). Anyway, I've installed a 4.4.4 kernel with the set_mask_bits patch applied now as well. I'll be looking in the days to come for the warning (hopefully without the server hanging).
Comment 24 Neil Brown 2016-03-06 21:52:40 UTC
> I'm still wondering why this problem affects so few people

That is a puzzle.  Different filesystems could trigger the bug to different extents, but the only filesystem mentioned here is ext4 and that is probably quite common.  Does anyone use anything else?

It is possible that "REQ_SYNC" requests are more likely to trigger the bug and they are probably used for wring the journal (hence the observation that the end of the journal was lost).   Maybe the problem affect data=journal more than data=ordered.  Is anyone using data=journal on the ext4?
Comment 25 Michael Lipp 2016-03-06 22:14:52 UTC
I use my raid6 as lvm2 physical volume. Some of the logical volumes are used as "raw" disk devices for virtual machines. On those disks are -- apart from ext4 file systems -- swap partitions and NTFS file systems (used by the Windows virtual machines). Actually, when the server stopped last night, one of the Windows machines was executing an update -- one of the rare occasions when there is significant disk I/O load on the NTFS file systems.
Comment 26 Tom Weber 2016-03-07 09:48:22 UTC
First of all, I can confirm that the patch works! :)

I've seen the WARN_ON_ONCE on one of the patched machines today but no negative impact.

My usual setup is mdraid - LVM - DRBD - ext4/or raw for kvm.
The crashes appeared on ext4 on top of this (one box hosting kvm imagefiles, the other having KVMs and various LXCs)
I was in fact using data=journal on one of the filesystems when this happened first, but turning it off made no difference. The other crashing box didn't have it at all.

Note that this is a very common setup for me and I have at least 10 machines with kernels > 4.1.x in similar setups that are running fine for weeks/months now.

And as I wrote, I could not trigger the bug hammering the IO system either. Very strange.
Comment 27 Shaohua Li 2016-03-07 17:55:38 UTC
Awesome! Neil, please make a formal patch :)

For the issue, the race window is tiny. From my understanding, two tasks are doing write to the same stripe. each write partial stripe and the two writes consist of a full stripe write. The first write runs first but it leaves raid5_make_request after the second write runs.
Comment 28 Neil Brown 2016-03-09 02:14:38 UTC
Done.

http://thread.gmane.org/87r3fkjttq.fsf@notabene.neil.brown.name

Thanks everyone!