Bug 15464 - kernel BUG at ...drivers/md/raid5.c:2768 after harddisk in raid6-array failed
kernel BUG at ...drivers/md/raid5.c:2768 after harddisk in raid6-array failed
Status: RESOLVED UNREPRODUCIBLE
Product: IO/Storage
Classification: Unclassified
Component: MD
All Linux
: P1 normal
Assigned To: Neil Brown
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-07 09:05 UTC by kernelbugtracker
Modified: 2012-07-05 15:50 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.32 x86_64
Tree: Mainline
Regression: No


Attachments
Kernel log of system startup and the event that triggered the kernel bug message (31.28 KB, application/octet-stream)
2010-03-07 09:05 UTC, kernelbugtracker
Details

Description kernelbugtracker 2010-03-07 09:05:06 UTC
Created attachment 25392 [details]
Kernel log of system startup and the event that triggered the kernel bug message

kernel BUG at /build/buildd-linux-2.6_2.6.32-5-amd64-9RvY2G/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/md/raid5.c:2768

A harddisk failed during a scheduled array check. This kernel bug followed.

Current status according to /proc/mdstat:
md2 : active raid6 sdf[0] sdj[4] sdi[5](F) sdh[2] sdg[1]
      4395415488 blocks level 6, 64k chunk, algorithm 2 [5/4] [UUU_U]
      [====>................]  check = 20.0% (293219624/1465138496) finish=1104648.9min speed=17K/sec

The filesystem on the array is still mounted, however files are unaccessible (accessing process hangs waiting for IO)

Console broadcast:
[64736.822991] invalid opcode: 0000 [#1] SMP
[64736.823022] last sysfs file: /sys/devices/virtual/block/md0/md/mismatch_cnt
[64736.825867] Stack:
[64736.826056] Call Trace:
[64736.826415] Code: e9 ab 01 00 00 83 7c 24 3c 02 74 04 0f 0b eb fe f6 43 40 10 c7 43 58 00 00 00 00 0f 85 8f 01 00 00 41 8d 45 ff 39 44 24 30 7d 04 <0f> 0b eb fe 83 7c 24 3c 02 75 20 48 63 44 24 6c ff 44 24 2c 48
Comment 1 Neil Brown 2010-04-15 03:22:13 UTC
Thanks for the report.
Presumably you rebooted and have access to your data again?

I have been staring at the code for an hour or more and cannot
see how this could possibly happen.

Could you clarify one thing for me though?

Was the array already degraded while the 'check' was running, and
a second device failed, or was it optimal and this was the first failure.

Either way, I cannot see the BUG happening, but knowing might help
narrow the search space.

Thanks.
Comment 2 kernelbugtracker 2010-04-15 04:26:27 UTC
The array was intact/optimal/active with all 5 drives when it started checking (through the cron-scheduled execution of '/usr/share/mdadm/checkarray --cron --all --quiet').

The status of /proc/mdstat remained the same after the kernel bug happened, with the exception of the speed-indicator, which slowly dropped over time since the check didn't proceed any further (It was stuck at 293219624).

After a reboot, the array became available again (in Degraded mode) and the data was accessible.
I've since replaced the defective drive, rebuilt the array, and all is well.

The 5 drives are connected to a 1-to-5 port multiplier, maybe that is acting up when one of the drives fails.
Comment 3 kernelbugtracker 2011-02-06 13:10:15 UTC
And here it is again. This time with kernel version:
Linux 2.6.32-5-amd64 #1 SMP Fri Dec 10 15:35:08 UTC 2010 x86_64 GNU/Linux.
The hardware used to access the drives has changed from using port multipliers to having them all attached to a (from lspci):
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2116 PCI-Express Fusion-MPT SAS-2 [Meteor] (rev 02)

The situation is almost the same as when I started this bugreport: Harddrive fails and is removed from the array during a scheduled check. This time the failed array is configured as RAID-6. I still can't access the data (trying to access it freezes the shell).

Also, even though it says the check is done for md2, /proc/mdstat contains:
md2 : active raid6 sdh[0] sdt[5](F) sds[3] sdp[2] sdl[1]
      4395415488 blocks level 6, 64k chunk, algorithm 2 [5/4] [UUUU_]
      [=======>.............]  check = 36.6% (537332320/1465138496) finish=464198.9min speed=33K/sec
This is possibly caused by the fact that the array is inaccessible and any attempt to access it blocks the accessing process forever.

Excerpt from dmesg:
[2872063.176884] md: data-check of RAID array md2
[2872063.176887] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[2872063.176890] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[2872063.176897] md: using 128k window, over a total of 1465138496 blocks.
[2884700.024731] sd 2:0:14:0: [sdt] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[2884700.024737] sd 2:0:14:0: [sdt] Sense Key : Aborted Command [current]
[2884700.024742] sd 2:0:14:0: [sdt] Add. Sense: Information unit iuCRC error detected
[2884700.024749] sd 2:0:14:0: [sdt] CDB: Read(10): 28 00 40 0e 11 00 00 04 00 00
[2884700.024757] end_request: I/O error, dev sdt, sector 1074663680
[2884700.095972] sd 2:0:14:0: [sdt] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[2884700.095978] sd 2:0:14:0: [sdt] Sense Key : Aborted Command [current]
[2884700.095983] sd 2:0:14:0: [sdt] Add. Sense: Information unit iuCRC error detected
[2884700.095991] sd 2:0:14:0: [sdt] CDB: Read(10): 28 00 40 0e 15 00 00 04 00 00
[2884700.096001] end_request: I/O error, dev sdt, sector 1074664704
[2884700.102563] __ratelimit: 51 callbacks suppressed
[2884700.102569] raid5:md2: read error corrected (8 sectors at 1074663680 on sdt)
[  3 more of those (in different sectors)  ]
[2884700.119336] mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[  12 more of those  ]
[2884700.160212] raid5:md2: read error corrected (8 sectors at 1074664736 on sdt)
[  5 more of those (in different sectors)  ]
[2884700.167458] mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[  11 more of those  ]
[2884700.222720] sd 2:0:14:0: [sdt] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[2884700.222724] sd 2:0:14:0: [sdt] Sense Key : Aborted Command [current]
[2884700.222728] sd 2:0:14:0: [sdt] Add. Sense: Information unit iuCRC error detected
[2884700.222734] sd 2:0:14:0: [sdt] CDB: Read(10): 28 00 40 0e 11 20 00 04 00 00
[2884700.222743] end_request: I/O error, dev sdt, sector 1074663712
[2884700.222801] raid5: Disk failure on sdt, disabling device.
[2884700.222802] raid5: Operation continuing on 4 devices.
[2884700.271647] ------------[ cut here ]------------
[2884700.271711] kernel BUG at /build/buildd-linux-2.6_2.6.32-29-amd64-xcs37n/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/md/raid5.c:2764!
[2884700.271812] invalid opcode: 0000 [#1] SMP
[2884700.271909] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/usb3/3-2/bcdDevice
[2884700.274147] CPU 1
[2884700.274215] Modules linked in: sg sr_mod cdrom nls_utf8 cifs sha256_generic aes_x86_64 aes_generic cbc openafs(P) fuse dm_crypt it87 hwmon_vid eeprom loop snd_pcm i915 snd_timer drm_kms_helper snd soundcore snd_page_alloc pcspkr evdev i2c_i801 drm i2c_algo_bit i2c_core video output button processor ext4 mbcache jbd2 crc16 dm_mod raid456 md_mod async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx sd_mod crc_t10dif ata_generic usb_storage mpt2sas pata_jmicron scsi_transport_sas e1000e thermal thermal_sys ahci libata scsi_mod uhci_hcd ehci_hcd usbcore nls_base [last unloaded: scsi_wait_scan]
[2884700.275567] Pid: 529, comm: md2_raid6 Tainted: P           2.6.32-5-amd64 #1 965G-DS3
[2884700.275567] RIP: 0010:[<ffffffffa018387f>]  [<ffffffffa018387f>] handle_stripe+0x89d/0x1785 [raid456]
[2884700.275567] RSP: 0018:ffff880129f39ce0  EFLAGS: 00010297
[2884700.275567] RAX: 0000000000000004 RBX: ffff880129cfac58 RCX: 0000000000000001
[2884700.275567] RDX: 0000000000000006 RSI: 0000000000000003 RDI: 0000000000000004
[2884700.275567] RBP: ffff880129cfac98 R08: ffff88012bb7abc0 R09: ffff88012b982a00
[2884700.275567] R10: 0000000000000286 R11: ffff88012c23ec00 R12: ffff88012b982a00
[2884700.275567] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000040003
[2884700.275567] FS:  0000000000000000(0000) GS:ffff880005480000(0000) knlGS:0000000000000000
[2884700.275567] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[2884700.275567] CR2: 00007f37dc9865a4 CR3: 000000012e25d000 CR4: 00000000000006e0
[2884700.275567] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[2884700.275567] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[2884700.275567] Process md2_raid6 (pid: 529, threadinfo ffff880129f38000, task ffff88012b9af810)
[2884700.275567] Stack:
[2884700.275567]  0000000081041b51 ffff880005495780 0000000000000001 0004880005495780
[2884700.275567] <0> 0000000000000001 0000000000000000 0000000000000003 0000000100000000
[2884700.275567] <0> 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[2884700.275567] Call Trace:
[2884700.275567]  [<ffffffffa0184b0c>] ? raid5d+0x3a5/0x3ee [raid456]
[2884700.275567]  [<ffffffff812fa9d4>] ? schedule_timeout+0x2e/0xdd
[2884700.275567]  [<ffffffffa016883f>] ? md_thread+0xf1/0x10f [md_mod]
[2884700.275567]  [<ffffffff81064cea>] ? autoremove_wake_function+0x0/0x2e
[2884700.275567]  [<ffffffffa016874e>] ? md_thread+0x0/0x10f [md_mod]
[2884700.275567]  [<ffffffff81064a1d>] ? kthread+0x79/0x81
[2884700.275567]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
[2884700.275567]  [<ffffffff810649a4>] ? kthread+0x0/0x81
[2884700.275567]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
[2884700.275567] Code: e9 ab 01 00 00 83 7c 24 3c 02 74 04 0f 0b eb fe f6 43 40 10 c7 43 58 00 00 00 00 0f 85 8f 01 00 00 41 8d 45 ff 39 44 24 30 7d 04 <0f> 0b eb fe 83 7c 24 3c 02 75 20 48 63 44 24 6c ff 44 24 2c 48
[2884700.275567] RIP  [<ffffffffa018387f>] handle_stripe+0x89d/0x1785 [raid456]
[2884700.275567]  RSP <ffff880129f39ce0>
[2884700.279877] ---[ end trace 15d85388be6ad5e1 ]---
[2884701.688833] md: md2: data-check done.

Note You need to log in before you can comment on or make changes to this bug.