Bug 16081 - Data loss after crash during heavy I/O
Summary: Data loss after crash during heavy I/O
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-31 15:19 UTC by lkolbe
Modified: 2011-02-28 01:24 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.32.12 (Debian)
Subsystem:
Regression: No
Bisected commit-id:


Attachments
end of trace (205.07 KB, image/jpeg)
2010-05-31 15:19 UTC, lkolbe
Details
lsscsi of host (2.30 KB, text/plain)
2010-05-31 15:21 UTC, lkolbe
Details
lspci -vvv of host (40.23 KB, text/plain)
2010-05-31 15:22 UTC, lkolbe
Details
oops after writing 1TB (32.62 KB, text/plain)
2010-06-02 15:57 UTC, lkolbe
Details
root-filesystem borkage (5.77 KB, text/plain)
2010-06-02 16:44 UTC, lkolbe
Details

Description lkolbe 2010-05-31 15:19:35 UTC
Created attachment 26590 [details]
end of trace 

On a Supermicro X7DWN+, Intel 5400 chipset, Xeon E5420, 8GB RAM, Adaptec 52445 RAID controller, LSI SAS1068E controller. We have two 9TB ext4-filesystems on LVM on a 20TB RAID50 spanning 24 disks, used as a diskpool for bacula. After writing about 10TB of data (8.5TB to the first, 1.5TB to the second fs), the machine crashed hard (screenshot attached). Afterwards, the filesystems were both bonkers (after e2fsck 1.41.9 ran over them):

shepherd:~# mount /dev/data/badp1 /mnt/
mount: wrong fs type, bad option, bad superblock on /dev/mapper/data-badp1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

shepherd:~# dmesg | tail
[ 8720.688682] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1 failed (49189!=48621)
[ 8720.688708] EXT4-fs (dm-1): group descriptors corrupted!
[14726.691071] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1 failed (49189!=48621)
[14726.691097] EXT4-fs (dm-1): group descriptors corrupted!
[14737.262709] EXT4-fs (dm-2): mounted filesystem with ordered data mode
[15315.441515] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1 failed (49189!=48621)
[15315.441540] EXT4-fs (dm-1): group descriptors corrupted!

shepherd:~# mount /dev/data/badp2 /mnt/
shepherd:~# ls -la /mnt/
total 80
drwxr-xr-x   3 root root  4096 2010-05-31 13:10 .
drwxr-xr-x  23 root root  4096 2010-05-31 13:01 ..
drwx------ 250 root root 69632 2010-05-31 13:10 lost+found
shepherd:~# ls -la /mnt/lost+found/ | head -n 20
total 216936
drwx------ 250 root       root          69632 2010-05-31 13:10 .
drwxr-xr-x   3 root       root           4096 2010-05-31 13:10 ..
c----wxr--   1  774037444  162299347 237, 210 1957-02-23 13:50 #1000
brwx-----T   1 1954511736 3121970260 249, 121 1922-08-12 15:08 #10021
b-w---xrwt   1  543753214 3130053982 234, 213 2012-06-01 07:58 #10027
c--S--sr-T   1 3871079531 3443641576   2, 232 2036-01-31 13:12 #10036
-r-S-w-r-T   1 2298731406  344458386    32768 2035-05-22 08:46 #10046
brw---Srw-   1 2052225653 4012639896 218, 196 1912-06-23 18:14 #10067
prwS-wSr-x   1 2235883341 1302567651        0 1927-10-10 00:51 #10086
s-wS--x-wt   1 2286828425 2999490124        0 1949-08-22 22:50 #10109
crw--wSrwt   1 3083778288 3882824206 148, 212 2003-07-28 08:32 #10126
s-wS--sr-x   1  874900871   80451928        0 1977-11-28 01:52 #10130
s--sr-x---   1 1903432768    1059722        0 2013-07-05 00:55 #10131
c-w-r-Sr-T   1 3259732952 2590389953   9,  22 2012-06-19 14:56 #10147
pr-x-w--wt   1 1627318825 1016384218        0 1956-12-27 06:01 #10160
srw-r-SrwT   1 2603486838 3240878817        0 1954-11-16 08:43 #10177
srw---srwt   1  458009213  951782573        0 2023-12-03 18:43 #10184
brwxr--rwx   1 2423698452 2252742920  44, 231 1956-07-25 07:28 #10197
brwS-wS-w-   1 3480615060 1244965598  44, 189 2006-10-21 17:03 #1020

This is the second or third time the machine crashed after writing ca. 10TB of data, but the first time we see this kind of data corruption.

Any hints on how to debug/reprocude such a thing? For the moment, we keep the broken filesystem for further analysis (if that's neccessary), but sadly this is our primary backup diskpool and we need to have it running again rather soon ...
Comment 1 lkolbe 2010-05-31 15:21:24 UTC
Created attachment 26591 [details]
lsscsi of host
Comment 2 lkolbe 2010-05-31 15:22:26 UTC
Created attachment 26592 [details]
lspci -vvv of host
Comment 3 lkolbe 2010-05-31 15:24:17 UTC
One thing I forgot: Using Supermicro's current BIOS 1.2b, the machine exhibits machine check exceptions that Linux thinks are the hardwares fault. With their BIOS 1.1b, they do not happen.
Comment 4 Eric Sandeen 2010-06-01 16:38:04 UTC
Getting the whole original oops would be great (since it seems like you can reproduce it)...

Did e2fsck find anything?  (e2fsck -f?)

Were filesystem barriers left on, and does the storage support them?
Comment 5 lkolbe 2010-06-02 12:02:18 UTC
e2fsck segfaulted after about an hour on the first volume and had gazillions of questions for the second.

I don't know about barriers, mount says:
/dev/mapper/data-badp2 on /var/bacula/diskpool/fs2 type ext4 (rw,nosuid,nodev)

Wether the Adaptec 52445 supports barriers - I really don't know?
The Serial console is now finally working, so if this happens again we'll get a full stacktrace. It will most likely take a few days of backup runs to trigger, though. Thanks for looking into this!
Comment 6 Eric Sandeen 2010-06-02 12:10:00 UTC
If a barrier failed outright you'd see a note in dmesg / logs shortly after the mount, FWIW.  Anyway you didn't explicitly disable it.  Barrier support in dm/lvm is rather new, as well.  Just a thought...

Capturing a core from the segfaulted e2fsck would help fix -that- bug ... and attaching the output of the fscks might yield a clue as to what is damaged.

-Eric
Comment 7 lkolbe 2010-06-02 15:57:20 UTC
Created attachment 26618 [details]
oops after writing 1TB 

This was rather sooner than expected, after writing about 1TB of data to two ext4 filesystems with approx. 150MB/sec.

Hopefully this trace means something?
Comment 8 lkolbe 2010-06-02 16:44:04 UTC
Created attachment 26619 [details]
root-filesystem borkage

Hm. After a reset, both 9TB-Filesystems were normal (journal replayed. But a few minutes after the boot, we got really strange errors (see attachment) and could only resurrect the root-filesystem with a live-cd and it's fsck, as grub wouldn't detect a filesystem anymore. fsck fixed it, though (broken superblock and some minor fixes). The system boots as I write this, and I'll continue the same backup-tests but this time without barriers on both filesystems.
Comment 9 Eric Sandeen 2010-06-02 17:53:52 UTC
(In reply to comment #8)
> ... The system boots as I write this, and I'll continue the
> same backup-tests but this time without barriers on both filesystems.

no... turning barriers -off- certainly won't help anything.

Whenever I see bad metadata corruption post-crash-and-reset I worry about missing barriers.  My mention of them was only to see whether they are properly in use, as they should be on any storage w/ a volatile write cache.
Comment 10 Eric Sandeen 2010-06-02 18:06:54 UTC
(In reply to comment #8)
> Created an attachment (id=26619) [details]
> root-filesystem borkage

How confident are you in your storage?

> [  765.812082] attempt to access beyond end of device
> [  765.812088] dm-6: rw=256, want=18808645176, limit=8388608

the "want" value (in sectors) is ~9T.

The limit is oddly (?) 2^23 - 8388608, that many sectors comes out to exactly 4T.

IOW, now your block device appears to be much smaller than your filesystem....
Comment 11 Eric Sandeen 2010-06-02 18:24:29 UTC
As for the trace (attachment #26618 [details]) it looks like we've found a page w/o buffers.

There seems to be rather a lot going wrong with this machine, I'm having a hard time getting a feel for what the root cause might be...
Comment 12 Eric Sandeen 2010-06-02 18:28:32 UTC
I don't know if it's at all possible, but testing on block devices and filesystems just smaller than 8T would be an interesting datapoint, if that yields success...  we really should be perfectly safe at 9T but this is looking like maybe a write has wrapped somewhere and corrupted things.

A resident dm expert also requested the output of "dmsetup table" for the machine that yielded the "access beyond end of device" message.
Comment 13 lkolbe 2010-06-02 21:56:56 UTC
Funny thing is, dm-6 is the root-filesystem, and it's 4GB big. It lives on a VG consisting of one 100GB RAID-50 over 24 disks. Some relevant data:

shepherd:~# lvm lvs -a -o+devices
  LV          VG     Attr   LSize   Origin Snap%  Move Log Copy%  Convert Devices          
  badp1       data   -wi-ao   9.00T                                       /dev/sdb(25600)  
  badp2       data   -wi-ao   9.00T                                       /dev/sdb(2384896)
  baspool     data   -wi-ao   1.00T                                       /dev/sdb(4769792)
  bawork      data   -wi-ao 100.00G                                       /dev/sdb(0)      
  db1_srv     data   -wi-ao 100.00G                                       /dev/sdb(4744192)
  dir1_bawork data   -wi-ao 100.00G                                       /dev/sdb(5031936)
  db1_log     system -wi-ao   4.00G                                       /dev/sda1(7168)  
  db1_root    system -wi-ao   4.00G                                       /dev/sda1(6144)  
  db1_swap    system -wi-ao   4.00G                                       /dev/sda1(8192)  
  dir1_log    system -wi-ao   4.00G                                       /dev/sda1(4096)  
  dir1_root   system -wi-ao   4.00G                                       /dev/sda1(3072)  
  dir1_swap   system -wi-ao   4.00G                                       /dev/sda1(5120)  
  log         system -wi-ao   4.00G                                       /dev/sda1(1024)  
  root        system -wi-ao   4.00G                                       /dev/sda1(0)     
  swap        system -wi-ao   4.00G                                       /dev/sda1(2048)  

The requested dmsetup table:
shepherd:~# dmsetup table
data-dir1_bawork: 0 209715200 linear 8:16 41221620096
system-db1_log: 0 8388608 linear 8:1 58720640
system-db1_swap: 0 8388608 linear 8:1 67109248
system-db1_root: 0 8388608 linear 8:1 50332032
data-bawork: 0 209715200 linear 8:16 384
data-db1_srv: 0 209715200 linear 8:16 38864421248
data-baspool: 0 2147483648 linear 8:16 39074136448
system-dir1_swap: 0 8388608 linear 8:1 41943424
system-dir1_root: 0 8388608 linear 8:1 25166208
data-badp2: 0 19327352832 linear 8:16 19537068416
data-badp1: 0 19327352832 linear 8:16 209715584
system-swap: 0 8388608 linear 8:1 16777600
system-root: 0 8388608 linear 8:1 384
system-dir1_log: 0 8388608 linear 8:1 33554816
system-log 0 8388608 linear 8:1 8388992

Adaptec version numbers are: BIOS, Firmware, Boot flash: 17899
aacraid driver: 2461 (the version shipped with 2.6.32)

I have (yet) no reason not to trust our storage - it's one 100GB RAID-50 and one ~19TB RAID-50 on 24 Hitachi HDE721010SLA330 with firmware ST6OA3AA, if that means anything to anyone.

Since the last crash bacula has written 3.2TiB to data-badp1 and it's still running (when all backups are done, it should have written ~12TiB). We'll see if it survives tomorrow.

If it crashes again, I'll try 8TiB-Filesystems.

Thanks for taking your time!
Lukas
Comment 14 Eric Sandeen 2010-06-02 22:07:52 UTC
(In reply to comment #13)
> Funny thing is, dm-6 is the root-filesystem, and it's 4GB big. 

whoops you're right I missed a unit there :(  The reported limit was indeed 4G not 4T.  Still, why was it trying to read a block out at 9T ... ?  Hmmm.
Comment 15 Eric Sandeen 2010-06-02 22:09:55 UTC
Just realized all the root fs errors were on ext3, too - not ext4.  This gives me more reason to be worried about things outside the filesystem itself, I'm afraid.
Comment 16 lkolbe 2010-06-03 06:02:28 UTC
Thanks, I'll do another round of memtest then. Do you have any idea what else to look for/what to test?
Comment 17 Eric Sandeen 2010-06-03 14:19:42 UTC
I'd just review the storage configuration as well, I guess, though not sure of any specifics to look for.
Comment 18 lkolbe 2010-06-05 14:32:46 UTC
Thanks, though. After working flawlessly for more than 13TiB, we hit another crash today - a colleague called 'lsscsi', after that all commands quit with 'Bus error' for a while and the machine stuck with no messages on the serial line. Befor that, cat /proc/interrupts worked and showed massive ERR:

shepherd:/etc# cat /proc/interrupts 
             CPU0       CPU1       CPU2       CPU3       
[...]
 THR:          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0   Machine check exceptions
 MCP:         28         28         28         28   Machine check polls
 ERR:   37567046
 MIS:          0

I suppose this means it's not Linux fault?
Comment 19 Theodore Tso 2011-02-28 01:24:43 UTC
Closing this bug as it looks pretty clear it was caused by hardware problems.

Note You need to log in before you can comment on or make changes to this bug.