Created attachment 26590 [details] end of trace On a Supermicro X7DWN+, Intel 5400 chipset, Xeon E5420, 8GB RAM, Adaptec 52445 RAID controller, LSI SAS1068E controller. We have two 9TB ext4-filesystems on LVM on a 20TB RAID50 spanning 24 disks, used as a diskpool for bacula. After writing about 10TB of data (8.5TB to the first, 1.5TB to the second fs), the machine crashed hard (screenshot attached). Afterwards, the filesystems were both bonkers (after e2fsck 1.41.9 ran over them): shepherd:~# mount /dev/data/badp1 /mnt/ mount: wrong fs type, bad option, bad superblock on /dev/mapper/data-badp1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so shepherd:~# dmesg | tail [ 8720.688682] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1 failed (49189!=48621) [ 8720.688708] EXT4-fs (dm-1): group descriptors corrupted! [14726.691071] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1 failed (49189!=48621) [14726.691097] EXT4-fs (dm-1): group descriptors corrupted! [14737.262709] EXT4-fs (dm-2): mounted filesystem with ordered data mode [15315.441515] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1 failed (49189!=48621) [15315.441540] EXT4-fs (dm-1): group descriptors corrupted! shepherd:~# mount /dev/data/badp2 /mnt/ shepherd:~# ls -la /mnt/ total 80 drwxr-xr-x 3 root root 4096 2010-05-31 13:10 . drwxr-xr-x 23 root root 4096 2010-05-31 13:01 .. drwx------ 250 root root 69632 2010-05-31 13:10 lost+found shepherd:~# ls -la /mnt/lost+found/ | head -n 20 total 216936 drwx------ 250 root root 69632 2010-05-31 13:10 . drwxr-xr-x 3 root root 4096 2010-05-31 13:10 .. c----wxr-- 1 774037444 162299347 237, 210 1957-02-23 13:50 #1000 brwx-----T 1 1954511736 3121970260 249, 121 1922-08-12 15:08 #10021 b-w---xrwt 1 543753214 3130053982 234, 213 2012-06-01 07:58 #10027 c--S--sr-T 1 3871079531 3443641576 2, 232 2036-01-31 13:12 #10036 -r-S-w-r-T 1 2298731406 344458386 32768 2035-05-22 08:46 #10046 brw---Srw- 1 2052225653 4012639896 218, 196 1912-06-23 18:14 #10067 prwS-wSr-x 1 2235883341 1302567651 0 1927-10-10 00:51 #10086 s-wS--x-wt 1 2286828425 2999490124 0 1949-08-22 22:50 #10109 crw--wSrwt 1 3083778288 3882824206 148, 212 2003-07-28 08:32 #10126 s-wS--sr-x 1 874900871 80451928 0 1977-11-28 01:52 #10130 s--sr-x--- 1 1903432768 1059722 0 2013-07-05 00:55 #10131 c-w-r-Sr-T 1 3259732952 2590389953 9, 22 2012-06-19 14:56 #10147 pr-x-w--wt 1 1627318825 1016384218 0 1956-12-27 06:01 #10160 srw-r-SrwT 1 2603486838 3240878817 0 1954-11-16 08:43 #10177 srw---srwt 1 458009213 951782573 0 2023-12-03 18:43 #10184 brwxr--rwx 1 2423698452 2252742920 44, 231 1956-07-25 07:28 #10197 brwS-wS-w- 1 3480615060 1244965598 44, 189 2006-10-21 17:03 #1020 This is the second or third time the machine crashed after writing ca. 10TB of data, but the first time we see this kind of data corruption. Any hints on how to debug/reprocude such a thing? For the moment, we keep the broken filesystem for further analysis (if that's neccessary), but sadly this is our primary backup diskpool and we need to have it running again rather soon ...
Created attachment 26591 [details] lsscsi of host
Created attachment 26592 [details] lspci -vvv of host
One thing I forgot: Using Supermicro's current BIOS 1.2b, the machine exhibits machine check exceptions that Linux thinks are the hardwares fault. With their BIOS 1.1b, they do not happen.
Getting the whole original oops would be great (since it seems like you can reproduce it)... Did e2fsck find anything? (e2fsck -f?) Were filesystem barriers left on, and does the storage support them?
e2fsck segfaulted after about an hour on the first volume and had gazillions of questions for the second. I don't know about barriers, mount says: /dev/mapper/data-badp2 on /var/bacula/diskpool/fs2 type ext4 (rw,nosuid,nodev) Wether the Adaptec 52445 supports barriers - I really don't know? The Serial console is now finally working, so if this happens again we'll get a full stacktrace. It will most likely take a few days of backup runs to trigger, though. Thanks for looking into this!
If a barrier failed outright you'd see a note in dmesg / logs shortly after the mount, FWIW. Anyway you didn't explicitly disable it. Barrier support in dm/lvm is rather new, as well. Just a thought... Capturing a core from the segfaulted e2fsck would help fix -that- bug ... and attaching the output of the fscks might yield a clue as to what is damaged. -Eric
Created attachment 26618 [details] oops after writing 1TB This was rather sooner than expected, after writing about 1TB of data to two ext4 filesystems with approx. 150MB/sec. Hopefully this trace means something?
Created attachment 26619 [details] root-filesystem borkage Hm. After a reset, both 9TB-Filesystems were normal (journal replayed. But a few minutes after the boot, we got really strange errors (see attachment) and could only resurrect the root-filesystem with a live-cd and it's fsck, as grub wouldn't detect a filesystem anymore. fsck fixed it, though (broken superblock and some minor fixes). The system boots as I write this, and I'll continue the same backup-tests but this time without barriers on both filesystems.
(In reply to comment #8) > ... The system boots as I write this, and I'll continue the > same backup-tests but this time without barriers on both filesystems. no... turning barriers -off- certainly won't help anything. Whenever I see bad metadata corruption post-crash-and-reset I worry about missing barriers. My mention of them was only to see whether they are properly in use, as they should be on any storage w/ a volatile write cache.
(In reply to comment #8) > Created an attachment (id=26619) [details] > root-filesystem borkage How confident are you in your storage? > [ 765.812082] attempt to access beyond end of device > [ 765.812088] dm-6: rw=256, want=18808645176, limit=8388608 the "want" value (in sectors) is ~9T. The limit is oddly (?) 2^23 - 8388608, that many sectors comes out to exactly 4T. IOW, now your block device appears to be much smaller than your filesystem....
As for the trace (attachment #26618 [details]) it looks like we've found a page w/o buffers. There seems to be rather a lot going wrong with this machine, I'm having a hard time getting a feel for what the root cause might be...
I don't know if it's at all possible, but testing on block devices and filesystems just smaller than 8T would be an interesting datapoint, if that yields success... we really should be perfectly safe at 9T but this is looking like maybe a write has wrapped somewhere and corrupted things. A resident dm expert also requested the output of "dmsetup table" for the machine that yielded the "access beyond end of device" message.
Funny thing is, dm-6 is the root-filesystem, and it's 4GB big. It lives on a VG consisting of one 100GB RAID-50 over 24 disks. Some relevant data: shepherd:~# lvm lvs -a -o+devices LV VG Attr LSize Origin Snap% Move Log Copy% Convert Devices badp1 data -wi-ao 9.00T /dev/sdb(25600) badp2 data -wi-ao 9.00T /dev/sdb(2384896) baspool data -wi-ao 1.00T /dev/sdb(4769792) bawork data -wi-ao 100.00G /dev/sdb(0) db1_srv data -wi-ao 100.00G /dev/sdb(4744192) dir1_bawork data -wi-ao 100.00G /dev/sdb(5031936) db1_log system -wi-ao 4.00G /dev/sda1(7168) db1_root system -wi-ao 4.00G /dev/sda1(6144) db1_swap system -wi-ao 4.00G /dev/sda1(8192) dir1_log system -wi-ao 4.00G /dev/sda1(4096) dir1_root system -wi-ao 4.00G /dev/sda1(3072) dir1_swap system -wi-ao 4.00G /dev/sda1(5120) log system -wi-ao 4.00G /dev/sda1(1024) root system -wi-ao 4.00G /dev/sda1(0) swap system -wi-ao 4.00G /dev/sda1(2048) The requested dmsetup table: shepherd:~# dmsetup table data-dir1_bawork: 0 209715200 linear 8:16 41221620096 system-db1_log: 0 8388608 linear 8:1 58720640 system-db1_swap: 0 8388608 linear 8:1 67109248 system-db1_root: 0 8388608 linear 8:1 50332032 data-bawork: 0 209715200 linear 8:16 384 data-db1_srv: 0 209715200 linear 8:16 38864421248 data-baspool: 0 2147483648 linear 8:16 39074136448 system-dir1_swap: 0 8388608 linear 8:1 41943424 system-dir1_root: 0 8388608 linear 8:1 25166208 data-badp2: 0 19327352832 linear 8:16 19537068416 data-badp1: 0 19327352832 linear 8:16 209715584 system-swap: 0 8388608 linear 8:1 16777600 system-root: 0 8388608 linear 8:1 384 system-dir1_log: 0 8388608 linear 8:1 33554816 system-log 0 8388608 linear 8:1 8388992 Adaptec version numbers are: BIOS, Firmware, Boot flash: 17899 aacraid driver: 2461 (the version shipped with 2.6.32) I have (yet) no reason not to trust our storage - it's one 100GB RAID-50 and one ~19TB RAID-50 on 24 Hitachi HDE721010SLA330 with firmware ST6OA3AA, if that means anything to anyone. Since the last crash bacula has written 3.2TiB to data-badp1 and it's still running (when all backups are done, it should have written ~12TiB). We'll see if it survives tomorrow. If it crashes again, I'll try 8TiB-Filesystems. Thanks for taking your time! Lukas
(In reply to comment #13) > Funny thing is, dm-6 is the root-filesystem, and it's 4GB big. whoops you're right I missed a unit there :( The reported limit was indeed 4G not 4T. Still, why was it trying to read a block out at 9T ... ? Hmmm.
Just realized all the root fs errors were on ext3, too - not ext4. This gives me more reason to be worried about things outside the filesystem itself, I'm afraid.
Thanks, I'll do another round of memtest then. Do you have any idea what else to look for/what to test?
I'd just review the storage configuration as well, I guess, though not sure of any specifics to look for.
Thanks, though. After working flawlessly for more than 13TiB, we hit another crash today - a colleague called 'lsscsi', after that all commands quit with 'Bus error' for a while and the machine stuck with no messages on the serial line. Befor that, cat /proc/interrupts worked and showed massive ERR: shepherd:/etc# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 [...] THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 28 28 28 28 Machine check polls ERR: 37567046 MIS: 0 I suppose this means it's not Linux fault?
Closing this bug as it looks pretty clear it was caused by hardware problems.