Bug 43791 - kernel 3.4.3 + e2fsprogs 1.42 + hdparm-9.39 : Raid-1 : complete data loss
Summary: kernel 3.4.3 + e2fsprogs 1.42 + hdparm-9.39 : Raid-1 : complete data loss
Status: RESOLVED UNREPRODUCIBLE
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: other_other
URL: https://bugs.gentoo.org/show_bug.cgi?...
Keywords:
Depends on:
Blocks:
 
Reported: 2012-06-25 20:08 UTC by Manfred
Modified: 2015-02-19 16:46 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.4.3
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Manfred 2012-06-25 20:08:19 UTC
--------------
Short version:
--------------

Needing more space on my (quicker) HW-Raid-10,
I wanted to transfer ~850 GiB to a (slower) SW-Raid1.

Creation succeded, Transfer succeeded,
Compare (diff) succeded,
even Reboot succeded;
but after Power Off / Boot: all data lost,
filesystem only offering an empty "lost+found".

I'm _not_ really sure that this strange effect 
is to be considered a kernel problem 
or a RAID problem (yet),
but thought it might make sense to give a warning.

----- cite ----- [PM]

... there is no evidence of a problem with RAID.  The filesystem has
lost its contents.  So it *looks* like an error with "rm" or "mkfs".  It
possibly isn't that simple but it doesn't look at all like a RAID problem.

NeilBrown

----- /cite -----
[@Neil: Thank you for looking at it and for your comment]

I am continung to work on additional experiments,
but unfortunately they take soo loong ...

I am prepared to provide additional information as requested;
proposals how to debug this strange coincidence are welcome.
Comment 1 Manfred 2012-06-25 20:12:39 UTC
-------------
Long version:
-------------

Hardware involved:

AMD Phenom(tm) 9950 Quad-Core
8 GiB RAM
ASUS M2N-SLI Deluxe

Source: HW-Raid-10:
# lspci -s 02:00.0 -v
02:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
        Subsystem: Adaptec ASR-2405

Destination: SW-Raid-1:
hdparm -i /dev/sdb
.   Model=ST31500341AS, FwRev=CC1H, ...
hdparm -i /dev/sdc
.   Model=ST31500341AS, FwRev=CC1H, ...

These two are mounted upon an Adaptec 1220SA:
# lspci -s 03:00.0 -v
03:00.0 RAID bus controller: Silicon Image, Inc. Device 0242 (rev 01)                                                                                                                                                  
        Subsystem: Adaptec Device 0242

Kernel:
Running on 3.2.16, having noticed that
. - the problem with the radix-tree iterators was being fixed in 3.4.2 and
. - that Neil Brown's RAID fix had arrived in [3.4, 3.3.4, or 3.2.17],
I upgraded the kernel to 3.4.3 first.

To be cautious, I deleted the old Raid-1:
.  ddrescue -f /dev/zero /dev/sdb -b 4096
.  ddrescue -f /dev/zero /dev/sdc -b 4096
.  < after Reboot: no md any more >

confirmed TLER settings:
.  smartctl -l scterc,70,70 /dev/sdb
.  smartctl -l scterc,70,70 /dev/sdc

and built it a-new:
mdadm --create --verbose --metadata=1.2 /dev/md/ST-21 --level=mirror --raid-devices=2 /dev/sdb /dev/sdc

$ equery belongs mdadm
. sys-fs/mdadm-3.1.5 (/sbin/mdadm)

By purpose, I gave md some hours to complete syncing from /dev/sdb to /dev/sdc,
before even starting partitioning:
.  parted  -a optimal  /dev/md/ST-21
.        mklabel msdos
.        mkpart primary ext2 4096 -1

and creating the filesystem:
.  mkfs.ext4  -L ST-21-P1  -E lazy_itable_init=0,lazy_journal_init=0  /dev/md/ST-21p1
( -E : in order of being sure that no pending operations were left open)

$ equery belongs mkfs.ext4
. sys-fs/e2fsprogs-1.42 (/sbin/mkfs.ext4)

Notabene:
. "E2fsprogs 1.42 (November 29, 2011)
.  This release of e2fsprogs has support for file systems > 16TB."
and:
. "E2fsprogs 1.42.4 (June 12, 2012)
.  Fixed more 64-bit block number bugs (which could end up corrupting file systems!) in e2fsck, debugfs, and libext2fs."

/etc/fstab:
.  LABEL=ST-21-P1  /Mammut/ST-21-P1 ext4   defaults,noatime 1 2

fdisk -l :
...
/dev/md127p1    1,4T     21G  1,3T    2% /Mammut/ST-21-P1
...

Because this was data I did not need permanent access to,
the Seagate drives were configured to spin down after 10' without access:

equery list hdparm:
[IP-] [  ] sys-apps/hdparm-9.39:0

/etc/config/hdparm:
...
sdb_args="-S120"
sdc_args="-S120"
...

Now I copied the respective directory tree T:
.  cp -a  /<Raid-10-mountpoint>/T  /Mammut/ST-21-P1/

and checked the result with
.  diff -R  /<Raid-10-mountpoint>/T  /Mammut/ST-21-P1/T
as successful.

I'm sorry that I have to become a little bit unprecise now:
As far as I remember,
there was a reboot first, the copy still readable,
then an automatic spin-down.
After another reboot at some stage,
the copy was not visible while in stand-by mode;
bringing up the two disks, it was visible again.

Anyway:
Completely Power Off during night -
Power On next morning:
the copied T was _gone_     !!!,
but an (empty) "lost+found" ???


What I get now is the following:

# mdadm -Evvvvs
mdadm: No md superblock detected on /dev/md/mammut:ST-21p1.
mdadm: No md superblock detected on /dev/md/mammut:ST-21.
...
/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 16bd66f7:96a400f6:eb91f3c0:f5e58122
           Name : mammut:ST-21  (local to host mammut)
  Creation Time : Wed Jun 20 19:51:50 2012
     Raid Level : raid1
   Raid Devices : 2
 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 2930274848 (1397.26 GiB 1500.30 GB)
  Used Dev Size : 2930274848 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : a4cef825:a19980d2:285560d9:0c6da2af
    Update Time : Fri Jun 22 07:30:30 2012
       Checksum : e9b7551c - correct
         Events : 19
   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)
...
/dev/sdb:                                                                                                                                                                                                              
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 16bd66f7:96a400f6:eb91f3c0:f5e58122
           Name : mammut:ST-21  (local to host mammut)
  Creation Time : Wed Jun 20 19:51:50 2012
     Raid Level : raid1
   Raid Devices : 2
 Avail Dev Size : 2930275120 (1397.26 GiB 1500.30 GB)
     Array Size : 2930274848 (1397.26 GiB 1500.30 GB)
  Used Dev Size : 2930274848 (1397.26 GiB 1500.30 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : baa75e0c:424e949e:b15d863d:a5e31ef8

    Update Time : Fri Jun 22 07:30:30 2012
       Checksum : 48d361f4 - correct
         Events : 19
   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)
...

# ls -algR /Mammut/ST-21-P1/

/Mammut/ST-21-P1/:
insgesamt 24
drwxr-xr-x 3 root  4096 21. Jun 19:06 .
drwxr-xr-x 5 root  4096 21. Mai 21:37 ..
drwx------ 2 root 16384 20. Jun 19:58 lost+found

/Mammut/ST-21-P1/lost+found:
insgesamt 20
drwx------ 2 root 16384 20. Jun 19:58 .
drwxr-xr-x 3 root  4096 21. Jun 19:06 ..

!----------------------------------!
! No  /Mammut/ST-21-P1/T  any more !
!----------------------------------!
Comment 2 Manfred 2012-06-25 20:23:35 UTC
Short version + link submitted to

. . . linux-raid@vger.kernel.org

[ 2012-06-25__22:21 ]
Comment 3 Manfred 2012-10-15 16:36:00 UTC
This box had been in productive use and thus had to be upgraded
(today, it's vanilla-3.6.2 etc.).

Although trying hard over a couple of weekends,
the effect could not be reproduced with later versions of the named SW components.

Thus I would not mind to close this bug as "NEEDINFO"
(or whatever might be more appropriate).

Thanks.

Note You need to log in before you can comment on or make changes to this bug.