Bug 14354

Summary: Bad corruption with 2.6.32-rc1 and upwards
Product: File System Reporter: Holger Freyther (zecke)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: CLOSED CODE_FIX    
Severity: normal CC: adilger.kernelbugzilla, agk, aneesh.kumar, brian, bug-track, chris.mason, david, dmbtech, fredrik, gmazyland, kai.kasurinen, makovick, mathias.buren, maximlevitsky, nemesis, ricwheeler, rjw, sandeen, torvalds, tytso, ua_bugzilla_linux-kernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14230    
Attachments: fsck log
E2croncheck shell script
jbd2 debugging patch
Boot + mounting FS roo
Log message on restart after journal abort
fsck.log with blocks assigned to multiple inodes.
demsg host zwerg 2.6.32-rc2
demsg host=zwerg 2.6.31
dmesg hoost=mini v2.6.31
2: reboot + mounting ro after journal reply
2: fsck on issue detected in attachment 23398
dmesg 2.6.32-rc4 host=zwerg
fsck 2.6.32-rc4 host=zwerg
kernel log
fsck log
logs of all bisected kernels
skip the non-journal branch if readonly
Testing patch; ro-diff with a WARN_ON(1)
kernel config
disasm with patch
disasm without patch
Alexey's x86_64 config.
fsck log
fsck with journal corruption
console log w/ aneesh's patch
console log of 2.6.32-rc3 corruption
Horrible horrible debugging hack

Description Holger Freyther 2009-10-09 15:42:16 UTC
I don't know how to reproduce but I'm seeing the following issues with 2.6.32-rcX since the "BUG_ON with page buffers has been fixed" on a daily basis.

- Clean shutdown
- Reboot -> fs turns into ro mode
- Reboot -> fsck. I see blocks assigned to two inodes and many lost+found of files touched close before the shutdown or during the shutdown. This requires to go through the 1B, 1C, 1D pass on fsck.


I'm not sure how to reproduce and how to properly report it to be of use for anyone. I'm seeing ext4 corruption on a daily basis though.
Comment 1 Eric Sandeen 2009-10-09 15:51:36 UTC
(In reply to comment #0)
> I don't know how to reproduce but I'm seeing the following issues with
> 2.6.32-rcX since the "BUG_ON with page buffers has been fixed" on a daily
> basis.
> 
> - Clean shutdown
> - Reboot -> fs turns into ro mode

When it does this, you should get something in dmesg; can you attach that?

> - Reboot -> fsck. I see blocks assigned to two inodes and many lost+found of
> files touched close before the shutdown or during the shutdown. This requires
> to go through the 1B, 1C, 1D pass on fsck.

please attach the actual e2fsck output.

Thanks,
-Eric
Comment 2 Holger Freyther 2009-10-09 16:06:46 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > I don't know how to reproduce but I'm seeing the following issues with
> > 2.6.32-rcX since the "BUG_ON with page buffers has been fixed" on a daily
> > basis.
> > 
> > - Clean shutdown
> > - Reboot -> fs turns into ro mode
> 
> When it does this, you should get something in dmesg; can you attach that?

I will try, it is a bit difficult as the distro is not booting up to a login prompt in this case. I will try hard.

> 
> > - Reboot -> fsck. I see blocks assigned to two inodes and many lost+found
> of
> > files touched close before the shutdown or during the shutdown. This
> requires
> > to go through the 1B, 1C, 1D pass on fsck.
> 
> please attach the actual e2fsck output.

Do you have a better idea than using fsck -y / > /boot/fsck.output for keeping the log?
Comment 3 Eric Sandeen 2009-10-09 16:44:47 UTC
(In reply to comment #2)

> > When it does this, you should get something in dmesg; can you attach that?
> 
> I will try, it is a bit difficult as the distro is not booting up to a login
> prompt in this case. I will try hard.

if you see it on the screen just a general idea or even a photo is fine :)  (may need to switch consoles)
 
> > please attach the actual e2fsck output.
> 
> Do you have a better idea than using fsck -y / > /boot/fsck.output for
> keeping
> the log?

hopefully that works fine :)

Thanks,
-eric
Comment 4 Theodore Tso 2009-10-09 16:50:35 UTC
Can you also attach the output of dmesg, so we can see what kind of device you are using for your root filesystem, and what sort of boot messages are being emitted by the kernel?   This could very well be a device driver problem.

Also, what distribution are you using?
Comment 5 Oleksij Rempel (fishor) 2009-10-10 07:32:53 UTC
I have same issue three of my systems. This is no hardware issue. I will try to reproduce it on virtual system. Any tips how to make it easy?
Comment 6 Oleksij Rempel (fishor) 2009-10-10 16:48:58 UTC
Ok, i think i know now how it's happed. I nneded some time to play to reproduce it. There is _no_difference_ if you just boot 2.6.32 kernel run fsck and then boot 2.6.31 kernel and run fsck again - nothing will happen.

here is the way i reproduced it:
1. crash (reset or poweroff) 2.6.32-rc3 kernel,
2. start again in 2.6.32-rc3
3. run fsck it well looks clean.. but some progrums will lost it setting.

4. reboot and start with 2.6.31 kernel
5. run fsck and this will find that ext4 is brocken.
6. after fsck fix it.. many fils will be lost.
Comment 7 Oleksij Rempel (fishor) 2009-10-10 16:50:59 UTC
the guestion is why fsck think the partition is clean if it use 2.6.32 and if you force fsck on 2.6.31 it will "fix" it even ther was new files on broken bloks?
Comment 8 Oleksij Rempel (fishor) 2009-10-10 17:00:08 UTC
i use e2fsprogs version 1.41.9 -1ubuntu1
Comment 9 Oleksij Rempel (fishor) 2009-10-10 17:04:23 UTC
Created attachment 23333 [details]
fsck log

This log i get after steps described in my message before. This time i lost just some settings on compiz, evolution and miro ... Thank good i make bakups.
Comment 10 Theodore Tso 2009-10-10 19:54:58 UTC
>1. crash (reset or poweroff) 2.6.32-rc3 kernel,
>2. start again in 2.6.32-rc3
>3. run fsck it well looks clean.. but some progrums will lost it setting.

Are you running fsck with the -f option, or not?   

>4. reboot and start with 2.6.31 kernel

This is a clean shutdown or a crash?

> 5. run fsck and this will find that ext4 is brocken.

Again, is this an fsck -f (forced fsck), or just a normal fsck?  What arguments are you using to fsck each time?
Comment 11 Theodore Tso 2009-10-11 01:26:15 UTC
One more question; is this a completely stock 2.6.32-rcX kernel, or is this one with special patches from Ubuntu?  If so, can you give me a pointer to the Ubuntu git tree and the commit ID that was used?
Comment 12 Theodore Tso 2009-10-11 02:03:12 UTC
When the file system gets remounted read-only, can you please send the output of the "dmesg" command?   If you don't have an extra file system where you can save the dmesg output, please do "dmesg | grep -i ext4" and copy down what you see.  Thanks!!
Comment 13 Oleksij Rempel (fishor) 2009-10-11 12:31:21 UTC
Now i tryed to make this test clean as possible. 

1. start pc and after some time reset it or poweroff
2.after restart start kernel with option break=mount. it will not mount root fs so use initrd instead.
3. mount root partition manually in reed only mode.
4. chroot 
5. fsck -y /dev/root >> fsck.log
6. fsck -y -f /dev/root >> fsck.log

on both kernel after krasch "fsck -y" will return:
=============================================================
fsck from util-linux-ng 2.16
/dev/mapper/zwerg_buntu-root_one: clean, 266498/1220608 files, 2444502/4882432 blocks

dmesg on mount will say this (looks like both kernel ak the same way):
=============================================================
[   32.797149] EXT3-fs: dm-0: couldn't mount because of unsupported optional features (240).
[   32.808599] EXT4-fs (dm-0): INFO: recovery required on readonly filesystem
[   32.809755] EXT4-fs (dm-0): write access will be enabled during recovery
[   32.823038] EXT4-fs (dm-0): barriers enabled
[   33.014768] kjournald2 starting: pid 1166, dev dm-0:8, commit interval 5 seconds
[   33.014792] EXT4-fs (dm-0): delayed allocation enabled
[   33.014794] EXT4-fs: file extents enabled
[   33.014937] EXT4-fs: mballoc enabled
[   33.014954] EXT4-fs (dm-0): orphan cleanup on readonly fs
[   33.014958] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131262
[   33.014994] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131238
[   33.015004] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131164
[   33.015014] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131161
[   33.015023] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131137
[   33.015032] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131119
[   33.015041] EXT4-fs (dm-0): 6 orphan inodes deleted
[   33.015042] EXT4-fs (dm-0): recovery complete
[   33.397102] EXT4-fs (dm-0): mounted filesystem with ordered data mode

if force fsck "fsck -y -f .." on 2.6.31 kernel:
===========================================================
/dev/mapper/zwerg_buntu-root_one: clean, 266499/1220608 files, 2443934/4882432 blocks
fsck from util-linux-ng 2.16
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/mapper/zwerg_buntu-root_one: 266499/1220608 files (0.7% non-contiguous), 2443934/4882432 blocks

on 2.6.32-rX:
===========================================================
/dev/mapper/zwerg_buntu-root_one: clean, 266474/1220608 files, 2444555/4882432 blocks
fsck from util-linux-ng 2.16
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Entry 'AB08-118C' in /media (32769) has deleted/unused inode 123461.  Clear? yes

Entry 'mtab' in /var/lib/DeviceKit-disks (57430) has deleted/unused inode 393.  Clear? yes

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 407
Connect to /lost+found? yes

Inode 407 ref count is 2, should be 1.  Fix? yes

Inode 32769 ref count is 7, should be 6.  Fix? yes

Pass 5: Checking group summary information
Block bitmap differences:  -1606658 -1606772
Fix? yes

Free blocks count wrong for group #0 (9591, counted=9592).
Fix? yes

Free blocks count wrong (2437877, counted=2437878).
Fix? yes

Free inodes count wrong for group #15 (6623, counted=6624).
Fix? yes

Directories count wrong for group #15 (791, counted=790).
Fix? yes

Free inodes count wrong (954134, counted=954135).
Fix? yes


/dev/mapper/zwerg_buntu-root_one: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/zwerg_buntu-root_one: ***** REBOOT LINUX *****
/dev/mapper/zwerg_buntu-root_one: 266473/1220608 files (0.7% non-contiguous), 2444554/4882432 blocks

I i didn't get to reproduce it on kvm-qemu.
Comment 14 Theodore Tso 2009-10-11 19:07:10 UTC
Holger, could you try running same test and see if you get similar results?

Alexey, when you say that the dmesg was the same for both kernels, was it _identical_?   Was it always the same number of orphaned inodes which were deleted, and were the inode numbers always the same?  And you're sure you didn't see anything like:

EXT4-fs (dm-0): error: ext4_fill_super: journal transaction 4641626 is corrupt
EXT4-fs (dm-0: Mounting filesystem read-only

In Holger's report, he mentions that after the first reboot, the file system turns read-only.   Holger, what do you see in your system logs after each reboot, in particular before the filesystem gets mounted or remounted read-only?
Comment 15 Oleksij Rempel (fishor) 2009-10-11 21:45:34 UTC
Yes, i had read-only problem too, but i can't reproduce it now. I had some trouble to mount system and fsck it, some important files was corrupt. Suddenly all i wanted is to make it work again.. so there is not dmesg after it.
On my major PC i used debsums to find broken files and reinstall this. On my laptop i didn't had so mach luck, the packagebase was corrupted so i reformatted and reinstall the system. I can try to go back to 2.6.32-rcX and use it until next crush, but i need to know what to do, to extrakt all information will need for this bug. Thirst good idea will be probably remote syslog server.
Comment 16 Theodore Tso 2009-10-11 23:01:06 UTC
I've been trying to recreate this failure mode, and haven't had any luck.   It looks like you are using LVM, so a really good thing to do is to use the e2croncheck script right after you reboot and login to the system.  If it reports a clean filesystem, no worries.  If it doesn't, then it would really be good to snapshot the output of dmesg to /tmp (I assume you have /tmp mounted as tmpfs) and e-mail it someplace safe, and e2croncheck can also be configured to e-mail the e2fsck output someplace safe as well.

I'll note that that what Holger and Alexey are seeing are somewhat different.  Holger seems to be seeing a problem after a clean shutdown after re-installing glibc from a build directory.  That would imply orphaned inode handling getting screwed up some how, but I haven't been able to reproduce this on my systems.   Alexey is seeing a problem after a crash/power failure, which implies some sort of issue with journal replay.  One change that we did make between 2.6.31 and 2.6.32 is that we enable journal checksums by default.  In theory if the hard drive is ignoring barriers, or is otherwise screwing up the journal I/O, we could get a journal checksum error that would abort the journal playback.  That in theory could explain Alexey's symptoms, but he didn't report a journal checksum error message.

So I really don't know what's going on right now.
Comment 17 Theodore Tso 2009-10-12 00:02:59 UTC
Created attachment 23350 [details]
E2croncheck shell script
Comment 18 Theodore Tso 2009-10-12 02:18:41 UTC
Created attachment 23353 [details]
jbd2 debugging patch

The jbd2-checksum-debugging patch adds two options which can be added to the boot line.   The first, jbd2.xsum_debug=1, will print two additional kernel messages when replaying a journal that will help debug checksums.   The second, jbd2.ignore_xsum=1, will cause jbd2 to ignore any checksum errors.

If you can come up with a repeatable test case, it would be useful to apply this patch and then add "jbd2.xsum_debug=1 jbd2.ignore_xsum=1" to the boot line.   Then let me know (a) if the problem goes away when you add the boot command-line options, and (b) give me the dmesg output either way.

Thanks!
Comment 19 Holger Freyther 2009-10-12 02:24:09 UTC
(In reply to comment #16)

> I'll note that that what Holger and Alexey are seeing are somewhat different. 
> Holger seems to be seeing a problem after a clean shutdown after
> re-installing
> glibc from a build directory.  

I have been trying to reproduce it the last two days but didn't really succeed.
I'm not sure if it still makes sense to upload a dmesg/config to see my system.
Basically I'm running Ubuntu Karmic (one thing in common) with a self built
kernel (plain kernel.org) and I'm using a crypto root (so he and me have device
mapper in use of some sort).

I will try more to reproduce this. Thanks for taking a look.
Comment 20 Holger Freyther 2009-10-12 07:54:31 UTC
Created attachment 23356 [details]
Boot + mounting FS roo

This was reached after getting two crashes/screen blanks on 2.6.32-rc3 (something bad with KMS and the intel driver) and two forced resets. The filesystem got mounted read only during make -j 8 (using icecc/icecream to distribute compile jobs) in my WebKit build tree and a git pull --rebase in the linux-2.6 tree.
Comment 21 Holger Freyther 2009-10-12 07:56:08 UTC
Created attachment 23357 [details]
Log message on restart after journal abort

That is from the reboot (using an older 2.6.32-rc2 kernel due intel driver issues on the console). The kernel is from shortly after the BUG_ON fix on the pagebuffers (is that enough information).
Comment 22 Holger Freyther 2009-10-12 07:57:58 UTC
Created attachment 23358 [details]
fsck.log with blocks assigned to multiple inodes.

So the fs assigned recent blocks both to my WebKit build and git pull --rebase in the kernel tree.
Comment 23 Eric Sandeen 2009-10-12 15:24:10 UTC
When you do these "forced resets," are they power cycles?
Comment 24 Aneesh Kumar K.V 2009-10-12 15:27:13 UTC
This one was found on 2.6.30-1 reported by one of my friend

Hi Aneesh,

I got the second ext4 fs error on my wife's eeepc.  I have recorded what
was fixed.  The kernel is 2.6.30-1 from debian sid.

suvaipc:~# fsck.ext4 /dev/mapper/eeepc-suvai 
e2fsck 1.41.3 (12-Oct-2008)
/dev/mapper/eeepc-suvai contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 336893 has illegal block(s).  Clear<y>? yes

Illegal block #12 (7602287) in inode 336893.  CLEARED.
Illegal block #13 (6422560) in inode 336893.  CLEARED.
Illegal block #15 (7274595) in inode 336893.  CLEARED.
Illegal block #16 (7536750) in inode 336893.  CLEARED.
Illegal block #17 (6553705) in inode 336893.  CLEARED.
Illegal block #18 (7471205) in inode 336893.  CLEARED.
Illegal block #19 (6553701) in inode 336893.  CLEARED.
Illegal block #20 (6357024) in inode 336893.  CLEARED.
Illegal block #22 (7602281) in inode 336893.  CLEARED.
Illegal block #23 (7798816) in inode 336893.  CLEARED.
Illegal block #24 (7536737) in inode 336893.  CLEARED.
Too many illegal blocks in inode 336893.
Clear inode<y>? yes

Restarting e2fsck from the beginning...
/dev/mapper/eeepc-suvai contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 336889: 1347590 1347591 1347592 1347593 1347595 1347597
Multiply-claimed block(s) in inode 336890: 1898439 1898442 1898443 1898444 1898445 1898446 1898513 1898514 1898515 1898516 1898517 1898518 1898519 1898520 1898521 1898522 1898543 1898544 1898545
Multiply-claimed block(s) in inode 336891: 1361951 1361952 1361961 1361962 1361963 1361964 1361965 1361966 1361967 1361968 1361969 1361970 1361971 1361972 1361973 1361974 1361975 1361976 1361977 1361978
Multiply-claimed block(s) in inode 336892: 1415316 1415317 1415318 1415319 1415320 1415321 1415344 1415345
Multiply-claimed block(s) in inode 336894: 1415359 1415382 1416689
Multiply-claimed block(s) in inode 336895: 1898381 1898382 1898383 1898384 1898385 1898386 1898387 1898388 1898389 1898390 1898391 1898392 1898393 1898394 1898395 1898396 1898397 1898398 1898399 1898400 1898401 1898402 1898403 1898404 1898405 1898406 1898407 1898408 1898409 1898410 1898411 1898412 1898413 1898414 1898415 1898416 1898417 1898418 1898419 1898420 1898421 1898422 1898423 1898424 1898425 1898426 1898427 1898428 1898429 1898430 1898431 1898432 1898433 1898434 1898435 1898436 1898437 1898438
Multiply-claimed block(s) in inode 336896: 1361922 1361923 1361924 1361925 1361926 1361927 1361929 1361930 1361931 1361932 1361933 1361934 1361935 1361947 1361948 1361949 1361950 1361979 1361980
Multiply-claimed block(s) in inode 336897: 1347590 1347591 1347592 1347593 1347595 1347597
Multiply-claimed block(s) in inode 336898: 1898439 1898442 1898443 1898444 1898445 1898446 1898513 1898514 1898515 1898516 1898517 1898518 1898519 1898520 1898521 1898522 1898543 1898544 1898545
Multiply-claimed block(s) in inode 336899: 1361951 1361952 1361961 1361962 1361963 1361964 1361965 1361966 1361967 1361968 1361969 1361970 1361971 1361972 1361973 1361974 1361975 1361976 1361977 1361978
Multiply-claimed block(s) in inode 336900: 1415316 1415317 1415318 1415319 1415320 1415321 1415344 1415345
Multiply-claimed block(s) in inode 336902: 1415359 1415382 1416689
Multiply-claimed block(s) in inode 336903: 1898381 1898382 1898383 1898384 1898385 1898386 1898387 1898388 1898389 1898390 1898391 1898392 1898393 1898394 1898395 1898396 1898397 1898398 1898399 1898400 1898401 1898402 1898403 1898404 1898405 1898406 1898407 1898408 1898409 1898410 1898411 1898412 1898413 1898414 1898415 1898416 1898417 1898418 1898419 1898420 1898421 1898422 1898423 1898424 1898425 1898426 1898427 1898428 1898429 1898430 1898431 1898432 1898433 1898434 1898435 1898436 1898437 1898438
Multiply-claimed block(s) in inode 336904: 1361922 1361923 1361924 1361925 1361926 1361927 1361929 1361930 1361931 1361932 1361933 1361934 1361935 1361947 1361948 1361949 1361950 1361979 1361980
Pass 1C: Scanning directories for inodes with multiply-claimed blocks
Pass 1D: Reconciling multiply-claimed blocks
(There are 14 inodes containing multiply-claimed blocks.)

......
........

Pass 2: Checking directory structure
Entry '#336913' in /lost+found (11) has deleted/unused inode 336913.  Clear<y>? yes

Entry '#336914' in /lost+found (11) has deleted/unused inode 336914.  Clear<y>? yes

Entry '#336915' in /lost+found (11) has deleted/unused inode 336915.  Clear<y>? yes

Entry '#336916' in /lost+found (11) has deleted/unused inode 336916.  Clear<y>? yes

Entry '#336917' in /lost+found (11) has deleted/unused inode 336917.  Clear<y>? yes

Entry '#336918' in /lost+found (11) has deleted/unused inode 336918.  Clear<y>? yes

Entry '#336919' in /lost+found (11) has deleted/unused inode 336919.  Clear<y>? yes

Entry '#336920' in /lost+found (11) has deleted/unused inode 336920.  Clear<y>? yes


Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 336558 ref count is 12, should be 11.  Fix<y>? yes

Unattached inode 336898
Connect to /lost+found<y>? yes

Inode 336898 ref count is 2, should be 1.  Fix<y>? yes

Unattached inode 336899
Connect to /lost+found<y>? yes

Inode 336899 ref count is 2, should be 1.  Fix<y>? yes

Unattached inode 336902
Connect to /lost+found<y>? yes

Inode 336902 ref count is 2, should be 1.  Fix<y>? yes

Unattached inode 336904
Connect to /lost+found<y>? yes

Inode 336904 ref count is 2, should be 1.  Fix<y>? yes

Pass 5: Checking group summary information
Block bitmap differences:  -(1344019--1344024) -(1347586--1347589) -1347594 -1347596 -(1347598--1347599) -(1349632--1349638) -1353729 -(1415305--1415308) -(1415329--1415332) -(1415338--1415343) -(1902878--1902967)
Fix<y>? yes

Free blocks count wrong for group #0 (31076, counted=30947).
Fix<y>? yes

Free blocks count wrong for group #41 (19489, counted=19511).
Fix<y>? yes

Free blocks count wrong for group #43 (1, counted=15).
Fix<y>? yes

Free blocks count wrong for group #58 (13532, counted=13622).
Fix<y>? yes

Free blocks count wrong (2373548, counted=2373545).
Fix<y>? yes

Inode bitmap differences:  -336893 -(336913--336920)
Fix<y>? yes

Free inodes count wrong for group #41 (7031, counted=7040).
Fix<y>? yes

Directories count wrong for group #41 (254, counted=253).
Fix<y>? yes

Free inodes count wrong (1302827, counted=1302836).
Fix<y>? yes


/dev/mapper/eeepc-suvai: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/eeepc-suvai: 7884/1310720 files (3.9% non-contiguous), 2869335/5242880 blocks
suvaipc:~# fsck.ext4 /dev/mapper/eeepc-suvai 
e2fsck 1.41.3 (12-Oct-2008)
/dev/mapper/eeepc-suvai: clean, 7884/1310720 files, 2869335/5242880 blocks
suvaipc:~# exit

Script done on Mon Sep 28 11:17:59 2009
Comment 25 Theodore Tso 2009-10-12 18:15:15 UTC
Hi, I'm trying to track down a problem relating to ext4 problems, in
kernel bugzilla 14354.  Originally it was thought this was a regression
2.6.31->2.6.32-rc1 regression, but the Aneesh Kumar's report shows a
very similar fsck transcript using a 2.6.30-1 kernel (which makes makes
it unclear whether this is really a regression or not), and all three
reports are tied together by the use of device-mapper.  

The fsck logs are consistent barriers being disabled.  As I recall under
some circumstances device mapper silently drops barrier requests --- I
thought this was fixed for device mapper in some cases.  What's the
current status of barrier requests in device mapper?  Are they
faithfully passed for standard (non-RAID) LVM2 volumes?  When was this
changed?   What about dm-crypto volumes?  (One of the reports was with a
dm-crypt volume from what I can tell.)

Thanks,

						- Ted
Comment 26 Oleksij Rempel (fishor) 2009-10-12 18:29:57 UTC
I'm not pretty sure about lvm, i do not use lvm on my laptop, but i had corruption problem on it too. Only bad thing happened here was ext4 oops durning bisect. But corruption happened aboun one week after this oops. 
I use lvm on my PC but i had corrupted sector on home partition too (home is not lvm).
Comment 27 Theodore Tso 2009-10-12 21:41:16 UTC
Alexey, do you have anything approaching reliable reproduction case?  What you stated in comment #6 implied that you could reliably reproduce this.   If you do, can you give me details about the hardware and kernel versions that were used, and the dmesg outputs, and whether LVM and/or device-mapper was involved?
Comment 28 Rafael J. Wysocki 2009-10-12 21:50:24 UTC
*** Bug 14370 has been marked as a duplicate of this bug. ***
Comment 29 Holger Freyther 2009-10-13 00:18:55 UTC
(In reply to comment #23)
> When you do these "forced resets," are they power cycles?

Yes, forced resets mean. I press the power button of my notebook and wait for
the device to be shut off and then power it again.
Comment 30 Oleksij Rempel (fishor) 2009-10-13 08:00:18 UTC
Created attachment 23376 [details]
demsg host zwerg 2.6.32-rc2

The PC i used to reproduce this issue has only lvm. There was mast all versions between 2.6.31 and 2.6.32-rc3. Mostly i testing intel graphick issues.
Comment 31 Oleksij Rempel (fishor) 2009-10-13 08:02:12 UTC
Created attachment 23377 [details]
demsg host=zwerg 2.6.31
Comment 32 Oleksij Rempel (fishor) 2009-10-13 08:47:44 UTC
Created attachment 23378 [details]
dmesg hoost=mini v2.6.31

This ist my laptop. I do not use here any "dm" (no raid, lvm or crypt). here i had same issue. I used most version betveen 2.6.31 and 2.6.31-rc2 (bisction). After reinstall there is only 2.6.31 kernel.
Comment 33 Theodore Tso 2009-10-13 13:16:01 UTC
Alexey, the dmesg outputs you are posting --- are these random dmesg's, or ones taken right before the system goes read-only, or the boot session *before* that?   If you can easily trigger the file system corruption, something really useful to do would be to set up syslog to do remote logging, so we can get the EXT4 syslog just as or after the root filesystem gets remounted read-only.   In addition, it would be useful to get the EXT4 syslog messages for one or two boot sessions *before* the file system was found to be corrupted, as well as the boot session where the file system was found corrupted.  

If the file system is getting corrupted after the journal is found corrupted due to a checksum mismatch, the syslog file may not have the EXT4 syslog messages, since the file system may end up getting remounted read-only.  And in that case, it's useful to see if there are any EXT4 syslog messages or any messages from the device driver showing any I/O errors.

The reason why I'm asking about LVM errors and dm-crypt is that at least one fsck log had the name /dev/mapper/sda5_crypt.   At this point I'm still trying to find common elements and information leading to a reliable reproduction --- and something that can confirm that this is indeed a regression, so I know whether or not focusing on recent commits is productive.  (The theory that it might have something to do with journal checksums is assuming that this is a regression, but so far none of the dmesg is showing the "Journal transaction XXXX is corrupt" that would be expected if the theory that this was caused by a journal checksum failure is correct.   So I don't want to focus too soon on recent commits as potential causes of this problem until we know for certain it is a regression.)
Comment 34 Mike Snitzer 2009-10-13 14:50:58 UTC
On Mon, Oct 12 2009 at  2:15pm -0400,
Theodore Ts'o <tytso@mit.edu> wrote:

> Hi, I'm trying to track down a problem relating to ext4 problems, in
> kernel bugzilla 14354.  Originally it was thought this was a regression
> 2.6.31->2.6.32-rc1 regression, but the Aneesh Kumar's report shows a
> very similar fsck transcript using a 2.6.30-1 kernel (which makes makes
> it unclear whether this is really a regression or not), and all three
> reports are tied together by the use of device-mapper.  
> 
> The fsck logs are consistent barriers being disabled.  As I recall under
> some circumstances device mapper silently drops barrier requests --- I
> thought this was fixed for device mapper in some cases.  What's the
> current status of barrier requests in device mapper?  Are they
> faithfully passed for standard (non-RAID) LVM2 volumes?  When was this
> changed?   What about dm-crypto volumes?  (One of the reports was with a
> dm-crypt volume from what I can tell.)

Barrier infrastructure started to get added to the DM core in 2.6.30, see:
http://git.kernel.org/linus/af7e466a1ace

But barriers were not enabled for all DM targets (_except_ dm-multipath)
until 2.6.31.  So 2.6.31's dm-crypt does support barriers, see:
http://git.kernel.org/linus/647c7db14ef9

If the underlying device(s) support barriers DM should faithfully pass
them on (again except for dm-multipath).  Also, requests with barriers
that result in -EOPNOTSUPP are retried without the barrier, see:
http://git.kernel.org/linus/51aa32284958

Mike
Comment 35 Alasdair G Kergon 2009-10-13 15:12:04 UTC
We used to issue EOPNOTSUPP in response to barriers (so flushing ceased to be supported when it became barrier-based). 'Basic' barrier support was added first (2.6.30-rc2), as Mike says, by waiting for relevant I/O to complete.  Then this was extended (2.6.31-rc1) to send barriers to the underlying devices for most dm types of dm targets.

To see which dm targets in a particular source tree forward barriers run:
  grep 'ti->num_flush_requests =' drivers/md/dm*c
(set to a non-zero value).
Comment 36 Ric Wheeler 2009-10-13 20:17:11 UTC
I am not sure that "waiting" will do what it needs to do. To do barriers properly, we need to control the volatile cache state of the target device which requires waiting and then sending an explicit cache flush.

Sounds like barriers should be working properly in 2.6.31 though.
Comment 37 Eric Sandeen 2009-10-13 20:28:45 UTC
If there is any sort of eventual reproducer, we could test the barrier theory by disabling write caches on the constituent drive(s) with hdparm.

-Eric
Comment 38 Oleksij Rempel (fishor) 2009-10-13 20:55:21 UTC
@Theodore, i can reproduce datalost not readonly bug, this happened mostly with 2.6.32-rcX.

I had some problems with readonly root, but it seems to be other bug in karmik upstart. See:
https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/434358

In my datalost issue, if some thing going wrong and kernel oops or freeze ( i need to poweroff the system) it seems like the _possability_ to lost some thing is match bigger if i use 2.6.32-rcX kernel. For example i get corrupted files which was written one week before the system was really broken.
Comment 39 Eric Sandeen 2009-10-13 21:10:00 UTC
Alexey, when you say corrupted files what do you mean.  Are the files missing?  Full of 0s?  Random data in part or all of the file?

As for data loss, -some- data loss on a power fail is not unexpected; buffered data which has not been synced -will- be lost, so we need to be careful about how we use these terms I think ...

-Eric
Comment 40 Theodore Tso 2009-10-13 22:02:31 UTC
Alexey, in addition to Eric's very good comments, the other thing that's worth checking is to do a "sudo touch /forcefsck" (which will force an fsck on the next reboot) before shutting down your system -- or do it early in your login session, before you do something dangeorus that might require a forced reboot/power cycle of your system).   

We need to distinguish between a file which was being replaced at the time of the power failure (and some desktop applications/toolkits are very update-happy in terms of constantly rewriting application dot files) versus the file system actually becoming corrupt.
Comment 41 Holger Freyther 2009-10-14 00:45:12 UTC
Created attachment 23398 [details]
2: reboot + mounting ro after journal reply

~2.6.32-rc2 crashed (due a bug in the intel driver), I rebooted into 2.6.31 but shortly after the journal reply and after the fsck that said everything the fs got mounted readonly.
Comment 42 Holger Freyther 2009-10-14 00:48:01 UTC
Created attachment 23400 [details]
2: fsck on issue detected in attachment 23398 [details]

This is the fsck log. Again indepedent files use the same block. In this case the .o and other files created during the kernel compile. I will try to prepare my netbook to work on a reproducable testcase.
Comment 43 Theodore Tso 2009-10-14 02:31:43 UTC
Hmm.... what were you doing right before the crash?   It looks like you were doing a kernel compile in /home/ich/source/linux/linux-2.6, since there were files with a modtime of  Tue Oct 13 16:25:55 2009.   What's a funny is that when these files were allocated, they used blocks that were apparently already in use by other object files in that some source directory with a mod-time of Sat Oct 10 13:51:14 2009.   Did you do a "make clean" at any time between Saturday and Tuesday that should have deleted these files?

If so, what I would strongly recommend is to run e2fsck -f on /dev/mapper/sda5_crypt before you mount it, each time.   What seems to be happening is that block allocation bitmap is getting corrupted somehow.   This is what causes the multiply claimed bitmaps.  I'm guessing the file system had gotten corrupted already, before the this last boot session.   The trick is catching things *before* the filesystem is badly corrupted that it gets remounted read-only and fsck has a large amount of multiply-claimed inodes to cleanup.   This is important for two reasons: (a) it helps us localize when the initial file system corruption is taking place, which helps us find a reproduction case, and (b) it reduces the chances that you'll lose data.

So the problem is that /dev/mapper/sda5_crypt is your root filesystem, and it's using dm-crypt.  *Sigh* this actually makes life difficult, since last I checked we can't do LVM snapshots of dm-crypt devices.   So that means you can't use the e2croncheck script, which is what I would recommend.  (What I'm actually doing right now is after every crash, I'm rebooting, logging in, and running e2croncheck right after I log in.   This allows me to notice any potential file system corruptions before it gets nasty --- the problem is I'm not noticing the problem.)   E2croncheck is much more convenient, since I can be doing other things while the e2fsck is running in one terminal window.   But I suspect dm-crypt is going to make this impossible.  One thing you could do is "tune2fs -c 1 /dev/mapper/sda5_crypt".  This will force a full check after every single reboot.  This will slow down your reboot (fortunately ext4 is faster at fsck times, but unfortunately sda5 appears to be a 211 GB filesystem, and it appears to be converted from an old ext3 filesystem, so you won't see the full 10x speedup in fsck times that you would if this was a created-from-scratch ext4 file system), but if you do this while trying to find the problem, it would be very helpful.

As I said, I'm still trying to reproduce the problem on my end, but it's been hard for me to find a reproduction case.
Comment 44 Holger Freyther 2009-10-14 03:17:21 UTC
(In reply to comment #43)
> Hmm.... what were you doing right before the crash?   It looks like you were
> doing a kernel compile in /home/ich/source/linux/linux-2.6, since there were
> files with a modtime of  Tue Oct 13 16:25:55 2009.   What's a funny is that
> when these files were allocated, they used blocks that were apparently
> already
> in use by other object files in that some source directory with a mod-time of
> Sat Oct 10 13:51:14 2009.   Did you do a "make clean" at any time between
> Saturday and Tuesday that should have deleted these files?

Not a make clean but a "rm -rf *; git checkout ." after the previous corruption in the source/linux/linux-2.6 directory.

Is it feasible to come up with a self test patch to ext4 to compare the allocated block bitmap with the bitmap in memory?

I will do a fsck on every boot now and will setup a testsystem in the evening to try to reproduce this (without dm/lvm inbetween).
Comment 45 Holger Freyther 2009-10-14 03:26:23 UTC
My idea for reproducing this would be:

$ rm -rf *
$ git checkout .
$ make


and then power cycle/reset during the make. I will report later on this.

One other question (and I'm not familiar with extX code at all). If the hypothesis would be that the allocated bitmap is corrupted. From where can this happen?

     - In memory bitmap not written out, journal not claiming these blocks, reassigning them during runtime?
       This would be a journal bug?

     - Block allocation at runtime forgetting to update the bit and block allocation assigning the same block again?
       This would sound like a missing lock somewhere?

     - delayed allocation picking the same block?
Comment 46 Oleksij Rempel (fishor) 2009-10-14 09:31:06 UTC
Oh... after all your comments the things making more sans. Now i see the corruption happened probably before i found it. I'll assume this was after last fsck, so like two or three week ago.

Thirst of all i need to say, my primer testing targets are intel graphic, acpi and ath9k (wifi).  Other bugs will be mostly skipped.

Like i said this bug was mixed with other ubuntu karmik bug which made it more complicated to find real when was this bug introduced.

The karmik bug was introduced in alpha ~2 - it will fail to boot on fsck. Normally i did "fsck /root"  or "mount -o remount,ro && fsck".

With kernel 2.6.32-rc1 i needed to bisect some intel graphic issue so i tested most version between 2.6.32-rc1 and 2.6.31.. some where shortly after 2.6.32-rc1 i get oops on of ext4, so i skipped testing  till short after 2.6.32-rc2 (this is proboly the time where real corruption was). With ~2.6.32-rc3 e get oops on ~scsi  when i tried to fdisk usbstick. While i needed to make this usbstick work, i rebooted (not reseted) and started with "stable" 2.6.31-ubuntu-13. This time i get same problem - PC wont boot, need to fsck (looks like karmik fsck issue but it isn't ... with message "bla bla /forcefsck" this was new for me), it didn't worked. remount,ro didt't worked too, reason /etc/mtab corrupted. cat /etc/mtab show only random trash in it. I started 2.6.31-ubuntu-13 to initrd  and mount root manually in ro, run fsck -f -y /root  returned a lot of messages like file1 and file2 use same sector and removing file3 bla bla... lostandfound... After reboot the system worked again, to find what files was lost i run debsums. I showed some packages which was installed one week before was broken. After i reinstalled broken packages seems like system work stable.

Now, after i "systematically" trying to analise this, noticed that all my broken laptops used max 2.6.32-rc1 before die. Probably my major testing pc get corrupted at 2.6.31-rc1 time too. Now i use e2croncheck script in cron.deily will see if i can confirm it.
Comment 47 Ric Wheeler 2009-10-14 13:21:46 UTC
Are you using the default upstream journal mode - writeback (not ordered)?
Comment 48 Theodore Tso 2009-10-14 13:36:10 UTC
Ric,

The default mode is ordered for ext4; the default mode for ext3 is writeback.  The reason for not making the change in ext4 is that with delayed allocation, we don't have the stalling problems that ext3 had.
Comment 49 Oleksij Rempel (fishor) 2009-10-14 22:08:38 UTC
i get same issue on 2.6.32-rc4. I worked on uvc (webcam) module so, installed same module over and over. At some time i accidentally pluged webcam off during it was used by cheese. As result i get kernel oops ... restart didn't worked so i reseted it. Blame me i forgot o save oops... i have now only dmesg after restart, fsck and dd-image of the root before dmesg.

[    5.573939] EXT4-fs (dm-0): INFO: recovery required on readonly filesystem
[    5.573942] EXT4-fs (dm-0): write access will be enabled during recovery
[    5.628089] EXT4-fs (dm-0): orphan cleanup on readonly fs
[    5.640860] EXT4-fs warning (device dm-0): ext4_orphan_get: bad orphan inode 133701!  e2fsck was run?
[    5.640863] ext4_test_bit(bit=2628, block=524304) = -1
[    5.640864] inode=ffff880138764840
[    5.640865] is_bad_inode(inode)=0
[    5.640866] NEXT_ORPHAN(inode)=1255554482
[    5.640867] max_ino=1220608
[    5.640867] i_nlink=0
[    5.640895] EXT4-fs (dm-0): recovery complete
[    5.807823] EXT4-fs (dm-0): mounted filesystem with ordered data mode
....
[   10.673483] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 394941
[   10.673626] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 394941
[   10.673699] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 394941
...

fsck has lots of:
File /lib/modules/2.6.32-rc4-00094-g80f5069-dirty/kernel/arch/x86/kernel/cpu/cpu_debug.ko (inode #393904, mod time Wed Oct 14 22:47:35 2009) 
  has 57 multiply-claimed block(s), shared with 2 file(s):
	... (inode #394934, mod time Wed Oct 14 21:53:34 2009)
	/lib/modules/2.6.32-rc4-00094-g80f5069-dirty/modules.symbols.bin (inode #394931, mod time Wed Oct 14 21:53:33 2009)
Clone multiply-claimed blocks? yes
Comment 50 Oleksij Rempel (fishor) 2009-10-14 22:09:56 UTC
Created attachment 23411 [details]
dmesg 2.6.32-rc4 host=zwerg
Comment 51 Oleksij Rempel (fishor) 2009-10-14 22:11:15 UTC
Created attachment 23412 [details]
fsck 2.6.32-rc4 host=zwerg
Comment 52 Theodore Tso 2009-10-15 04:24:49 UTC
BTW, I've been doing additional tests where I'm doing a kernel or e2fsprogs  build on my laptop, and then force a power down/crash while the build is running, and I still haven't been able to reproduce the bug on my end.   Can someone give me a hint about what else might be happening on the system when the crash happens?

Thanks!!
Comment 53 Oleksij Rempel (fishor) 2009-10-15 07:11:12 UTC
Don't worry. you'll get it.
Just take silver bullet and shoot the beast:
echo c > /proc/sysrq-trigger

This killed my root really hard. I still trying to reanimate it
Comment 54 Brian Rogers 2009-10-15 15:07:41 UTC
A few times recently I had corruption very similar to what's reported here, with Ext4 on an ordinary disk partition with no DM or anything fancy.

Personally I'm suspicious of the new boot scripts in Karmic. In trying to be as concurrent as possible, maybe it's doing something really bad like running twice and starting two concurrent instances of fsck on the same partition, or running fsck while the partition is already mounted with write access.

I can confirm an issue where it doesn't reboot after directing you to manually run a disk check and instead it tries to remount the root partition as rw, reported here:
https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/452196

But that doesn't explain the disk getting corrupted in the first place.
Comment 55 Theodore Tso 2009-10-15 17:58:01 UTC
hmm.. that's an interesting suggestion.   Is anyone who is reporting problems _not_ using Karmic userspace?  

I'm using Ubuntu 9.04.... maybe that's why I can't reproduce it?
Comment 56 Jindrich Makovicka 2009-10-15 18:30:36 UTC
I encountered the same on Debian Sid, which still uses SysV init.

I also use a plain partition, no DM or LVM.
Comment 57 Mikulas Patocka 2009-10-15 19:10:35 UTC
On Tue, 13 Oct 2009, Mike Snitzer wrote:

> On Mon, Oct 12 2009 at  2:15pm -0400,
> Theodore Ts'o <tytso@mit.edu> wrote:
> 
> > Hi, I'm trying to track down a problem relating to ext4 problems, in
> > kernel bugzilla 14354.  Originally it was thought this was a regression
> > 2.6.31->2.6.32-rc1 regression, but the Aneesh Kumar's report shows a
> > very similar fsck transcript using a 2.6.30-1 kernel (which makes makes
> > it unclear whether this is really a regression or not), and all three
> > reports are tied together by the use of device-mapper.  
> > 
> > The fsck logs are consistent barriers being disabled.  As I recall under
> > some circumstances device mapper silently drops barrier requests --- I
> > thought this was fixed for device mapper in some cases.  What's the
> > current status of barrier requests in device mapper?  Are they
> > faithfully passed for standard (non-RAID) LVM2 volumes?  When was this
> > changed?   What about dm-crypto volumes?  (One of the reports was with a
> > dm-crypt volume from what I can tell.)
> 
> Barrier infrastructure started to get added to the DM core in 2.6.30, see:
> http://git.kernel.org/linus/af7e466a1ace
> 
> But barriers were not enabled for all DM targets (_except_ dm-multipath)
> until 2.6.31.  So 2.6.31's dm-crypt does support barriers, see:
> http://git.kernel.org/linus/647c7db14ef9
> 
> If the underlying device(s) support barriers DM should faithfully pass
> them on (again except for dm-multipath).  Also, requests with barriers
> that result in -EOPNOTSUPP are retried without the barrier, see:
> http://git.kernel.org/linus/51aa32284958
> 
> Mike

Hi

Device mapper never drops data in the barrier request.

It converts barrier to non-barrier request or drops zero-sized barrier if 
the underlying device doesn't support barrier. Device mapped doesn't 
support barriers for dm-raid1 target (pending some other upstream patch 
that was ignored).

If the underlying device doesn't support barriers or if dm-raid1 target is 
being used, device mapper returns success on barriers and it simply waits 
for the request queue to drain when accepting a barrier.

The problems with "not flushing disk cache" only show up only if you pull 
the power plug out --- in this case, make sure that if you use dm-raid1, 
disk cache is turned off (hdparm -W 0). With other dm-targets, you can 
leave the disk cache on, because they accept barriers.

If you see errors under normal operations, without power loss, then it has 
probably nothing to do with barriers at all.

Mikulas
Comment 58 Theodore Tso 2009-10-15 23:40:47 UTC
Alexey, where you doing anything in particular when you did the "echo c > /proc/sysrq-trigger"?  Does it reproduce for you every single time?  

I did ("echo c > /proc/sysrq-trigger") while running a "make -j4" in an e2fsprogs source tree, and after the reboot, both on the root logical volume (where I have a ccache directory on an SSD drive) and my build logical volume (where I have an e2fsprogs source and build tree) were just fine after rebooting.

As I said, I'm having a devil of a time reproducing this.  I've tried power failures (removing the battery and pulling the power cord) while the build was taking place, suspending the system and then forcing a cold restart, suspending the system and then pulling the power, using "echo c >  /proc/sysrq-trigger", etc.   After each time, I've used e2croncheck to verify the sanity of my ssd root partition, and (in case the problem is the ext4 journal replay), mounting and then unmounting /kbuild, and then running e2fsck -f /dev/closure/kbuild.

In all cases, I've not had any file system inconsistencies.  (using 2.6.32-rc1, 2.6.32-rc3, and most recently, 2.6.32-rc4). I really need some help reproducing this!
Comment 59 Theodore Tso 2009-10-16 09:16:05 UTC
On Fri, Oct 16, 2009 at 12:28:18AM -0400, Parag Warudkar wrote:
> So I have been experimenting with various root file systems on my
> laptop running latest git. This laptop some times has problems waking
> up from sleep and that results in it needing a hard reset and
> subsequently unclean file system.

A number of people have reported this, and there is some discussion
and some suggestions that I've made here:

	http://bugzilla.kernel.org/show_bug.cgi?id=14354

It's been very frustrating because I have not been able to replicate
it myself; I've been very much looking for someone who is (a) willing
to work with me on this, and perhaps willing to risk running fsck
frequently, perhaps after every single unclean shutdown, and (b) who
can reliably reproduce this problem.  On my system, which is a T400
running 9.04 with the latest git kernels, I've not been able to
reproduce it, despite many efforts to try to reproduce it.  (i.e.,
suspend the machine and then pull the battery and power; pulling the
battery and power, "echo c > /proc/sysrq-trigger", etc., while
doing "make -j4" when the system is being uncleanly shutdown)

So if you can come up with a reliable reproduction case, and don't
mind doing some experiments and/or exchanging debugging correspondance
with me, please let me know.  I'd **really** appreciate the help.

Information that would be helpful to me would be:

a) Detailed hardware information (what type of disk/SSD, what type of
laptop, hardware configuration, etc.)

b) Detailed software information (what version of the kernel are you
using including any special patches, what distro and version are you
using, are you using LVM or dm-crypt, what partition or partitions did
you have mounted, was the failing partition a root partition or some
other mounted partition, etc.)

c) Detailed reproduction recipe (what programs were you running before
the crash/failed suspend/resume, etc.)


If you do decide to go hunting this problem, one thing I would
strongly suggest is that either to use "tune2fs -c 1 /dev/XXX" to
force a fsck after every reboot, or if you are using LVM, to use the
e2croncheck script (found as an attachment in the above bugzilla entry
or in the e2fsprogs sources in the contrib directory) to take a
snapshot and then check the snapshot right after you reboot and login
to your system.  The reported file system corruptions seem to involve
the block allocation bitmaps getting corrupted, and so you will
significantly reduce the chances of data loss if you run e2fsck as
soon as possible after the file system corruption happens.  This helps
you not lose data, and it also helps us find the bug, since it helps
pinpoint the earliest possible point where the file system is getting
corrupted.  

(I suspect that some bug reporters had their file system get corrupted
one or more boot sessions earlier, and by the time the corruption was
painfully obvious, they had lost data.  Mercifully, running fsck
frequently is much less painful on a freshly created ext4 filesystem,
and of course if you are using an SSD.)

If you can reliably reproduce the problem, it would be great to get a
bisection, or at least a confirmation that the problem doesn't exist
on 2.6.31, but does exist on 2.6.32-rcX kernels.  At this point I'm
reasonably sure it's a post-2.6.31 regression, but it would be good to
get a hard confirmation of that fact.

For people with a reliable reproduction case, one possible experiment
can be found here:

   http://bugzilla.kernel.org/show_bug.cgi?id=14354#c18

Another thing you might try is to try reverting these commits one at a
time, and see if they make the problem go away: d0646f7, 5534fb5,
7178057.  These are three commits that seem most likely, but there are
only 93 ext4-related commits, so doing a "git bisect start v2.6.31
v2.6.32-rc5 -- fs/ext4 fs/jbd2" should only take at most seven compile
tests --- assuming this is indeed a 2.6.31 regression and the problem
is an ext4-specific code change, as opposed to some other recent
change in the writeback code or some device driver which is
interacting badly with ext4.

If that assumption isn't true and so a git bisect limited to fs/ext4
and fs/jbd2 doesn't find a bad commit which when reverted makes the
problem go away, we could try a full bisection search via "git bisect
start v2.6.31 v2.6.31-rc3", which would take approximately 14 compile
tests, but hopefully that wouldn't be necessary.

I'm going to be at the kernel summit in Tokyo next week, so my e-mail
latency will be a bit longer than normal, which is one of the reason
why I've left a goodly list of potential experiments for people to
try.  If you can come up with a reliable regression, and are willing
to work with me or to try some of the above mentioned tests, I'll
definitely buy you a real (or virtual) beer.

Given that a number of people have reported losing data as a result,
it would **definitely** be a good thing to get this fixed before
2.6.32 is released.

Thanks,

						- Ted
Comment 60 Oleksij Rempel (fishor) 2009-10-16 10:48:04 UTC
seems like i can mostly reproduce this issue on my pc. The system seems to be normal on reset after fresh start. Usually i lost files on root partition, some times on home evolution settings. Seems like evolution doing "wrong". My best reproduction case is reinstall of kernel modules (sudo make modules_install). The worst thing happened - all modules of this kernel version was lost. To reproduce i did "make modules_install" three times  with random times between installs (10min-1h). 
I also use daily e2croncheck and  some  time i run it manually - i newer get any errors with this script. Today i will test vanilla 2.6.31 and probably begin to bisect.
      Alexey
Comment 61 Theodore Tso 2009-10-16 14:13:34 UTC
On Fri, Oct 16, 2009 at 05:15:58AM -0400, Theodore Tso wrote:
> These are three commits that seem most likely, but there are
> only 93 ext4-related commits, so doing a "git bisect start v2.6.31
> v2.6.32-rc5 -- fs/ext4 fs/jbd2" 

One correction.  The git bsect start usage is:

     git bisect start [<bad> [<good>...]] [--] [<paths>...]

So the correct git bisect start command should be:

     git bisect start v2.6.32-rc5 v2.6.31 -- fs/ext4 fs/jbd2

And similarly, "git bisect start v2.6.31 v2.6.31-rc3" should have been:

     git bisect start v2.6.31-rc3 v2.6.31

My apologies for any confusion.

						- Ted
Comment 62 Theodore Tso 2009-10-16 14:41:45 UTC
Alexey, if reinstalling kernel modules is your best reproduction case, are you seeing actual file system corruption, or simply zero-length files?  And are you using the "sync" command after re-installing the kernel modules?

If it is actual file system corruption, try mounting the root filesystem with "auto_da_alloc=0"  That is, add the boot command-line option "rootflags=auto_da_alloc=0".   This will probably increase the likelihood of zero-length files if you don't use "sync" after reinstalling the kernel modules --- this is to be expected, since we are disabling the hueristic used to try to avoid zero-length files which aren't synced out to disk before an unclean shutdown.   The goal of this experiment is to see if maybe there's something about this hueristic which is screwing up the ability to recover the filesystem free of corruptions after an unclean shutdown.
Comment 63 Oleksij Rempel (fishor) 2009-10-16 15:27:54 UTC
First result with v2.6.31:
I made some scripts to make me kill my pc easier. Now i log fsck, dmesg and md5sum every time after i kill it.
2.6.31 acting worse in this situation than 2.6.32 do. On mount readonly, kernel trying to recovery fs (... 6 orphan inodes deleted ... recovery complete) fsck -y -f return no warnings (only  "FILE SYSTEM WAS MODIFIED"). After reboot there is many _corrupt_ modules, files are existing but do not match (md5sum) with original files... diff <(xxd file_corrupt) <(xxd file_orig) show about 30% of difference. I expected fsck will remove korrupted files warn me about this.

I do not using sync in my script. This will be really nice if kernel did sync before pc die.

I will test now v2.6.32-rc4 again with md5sum and see the difference. After this i will try auto_da_alloc=0
Comment 64 Anonymous Emailer 2009-10-16 19:14:31 UTC
Reply-To: rwheeler@redhat.com

On 10/16/2009 05:15 AM, Theodore Tso wrote:
> On Fri, Oct 16, 2009 at 12:28:18AM -0400, Parag Warudkar wrote:
>    
>> So I have been experimenting with various root file systems on my
>> laptop running latest git. This laptop some times has problems waking
>> up from sleep and that results in it needing a hard reset and
>> subsequently unclean file system.
>>      
> A number of people have reported this, and there is some discussion
> and some suggestions that I've made here:
>
>       http://bugzilla.kernel.org/show_bug.cgi?id=14354
>
> It's been very frustrating because I have not been able to replicate
> it myself; I've been very much looking for someone who is (a) willing
> to work with me on this, and perhaps willing to risk running fsck
> frequently, perhaps after every single unclean shutdown, and (b) who
> can reliably reproduce this problem.  On my system, which is a T400
> running 9.04 with the latest git kernels, I've not been able to
> reproduce it, despite many efforts to try to reproduce it.  (i.e.,
> suspend the machine and then pull the battery and power; pulling the
> battery and power, "echo c>  /proc/sysrq-trigger", etc., while
> doing "make -j4" when the system is being uncleanly shutdown)
>    

I wonder if we might have better luck if we tested using an external 
(e-sata or USB connected) S-ATA drive.

Instead of pulling the drive's data connection, most of these have an 
external power source that could be turned off so the drive firmware 
won't have a chance to flush the volatile write cache. Note that some 
drives automatically write back the cache if they have power and see a 
bus disconnect, so hot unplugging just the e-sata or usb cable does not 
do the trick.

Given the number of cheap external drives, this should be easy to test 
at home....

Ric



> So if you can come up with a reliable reproduction case, and don't
> mind doing some experiments and/or exchanging debugging correspondance
> with me, please let me know.  I'd **really** appreciate the help.
>
> Information that would be helpful to me would be:
>
> a) Detailed hardware information (what type of disk/SSD, what type of
> laptop, hardware configuration, etc.)
>
> b) Detailed software information (what version of the kernel are you
> using including any special patches, what distro and version are you
> using, are you using LVM or dm-crypt, what partition or partitions did
> you have mounted, was the failing partition a root partition or some
> other mounted partition, etc.)
>
> c) Detailed reproduction recipe (what programs were you running before
> the crash/failed suspend/resume, etc.)
>
>
> If you do decide to go hunting this problem, one thing I would
> strongly suggest is that either to use "tune2fs -c 1 /dev/XXX" to
> force a fsck after every reboot, or if you are using LVM, to use the
> e2croncheck script (found as an attachment in the above bugzilla entry
> or in the e2fsprogs sources in the contrib directory) to take a
> snapshot and then check the snapshot right after you reboot and login
> to your system.  The reported file system corruptions seem to involve
> the block allocation bitmaps getting corrupted, and so you will
> significantly reduce the chances of data loss if you run e2fsck as
> soon as possible after the file system corruption happens.  This helps
> you not lose data, and it also helps us find the bug, since it helps
> pinpoint the earliest possible point where the file system is getting
> corrupted.
>
> (I suspect that some bug reporters had their file system get corrupted
> one or more boot sessions earlier, and by the time the corruption was
> painfully obvious, they had lost data.  Mercifully, running fsck
> frequently is much less painful on a freshly created ext4 filesystem,
> and of course if you are using an SSD.)
>
> If you can reliably reproduce the problem, it would be great to get a
> bisection, or at least a confirmation that the problem doesn't exist
> on 2.6.31, but does exist on 2.6.32-rcX kernels.  At this point I'm
> reasonably sure it's a post-2.6.31 regression, but it would be good to
> get a hard confirmation of that fact.
>
> For people with a reliable reproduction case, one possible experiment
> can be found here:
>
>     http://bugzilla.kernel.org/show_bug.cgi?id=14354#c18
>
> Another thing you might try is to try reverting these commits one at a
> time, and see if they make the problem go away: d0646f7, 5534fb5,
> 7178057.  These are three commits that seem most likely, but there are
> only 93 ext4-related commits, so doing a "git bisect start v2.6.31
> v2.6.32-rc5 -- fs/ext4 fs/jbd2" should only take at most seven compile
> tests --- assuming this is indeed a 2.6.31 regression and the problem
> is an ext4-specific code change, as opposed to some other recent
> change in the writeback code or some device driver which is
> interacting badly with ext4.
>
> If that assumption isn't true and so a git bisect limited to fs/ext4
> and fs/jbd2 doesn't find a bad commit which when reverted makes the
> problem go away, we could try a full bisection search via "git bisect
> start v2.6.31 v2.6.31-rc3", which would take approximately 14 compile
> tests, but hopefully that wouldn't be necessary.
>
> I'm going to be at the kernel summit in Tokyo next week, so my e-mail
> latency will be a bit longer than normal, which is one of the reason
> why I've left a goodly list of potential experiments for people to
> try.  If you can come up with a reliable regression, and are willing
> to work with me or to try some of the above mentioned tests, I'll
> definitely buy you a real (or virtual) beer.
>
> Given that a number of people have reported losing data as a result,
> it would **definitely** be a good thing to get this fixed before
> 2.6.32 is released.
>
> Thanks,
>
>                                               - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Comment 65 Oleksij Rempel (fishor) 2009-10-16 19:39:26 UTC
I forgot to post my hardware:
hdparm -i /dev/sda
 Model=WDC, FwRev=01.01A01, SerialNo=WD-WCAU45528406
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50
 BuffType=unknown, BuffSize=32767kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1,2,3,4,5,6,7

Board intel ID45DG (45G + ICH10R, 4G ram, Core2 Duo CPU E8400).
Comment 66 Oleksij Rempel (fishor) 2009-10-16 20:05:36 UTC
Do not trust dmesg or fsck.log, use md5sum!
Ok, now testresults based on md5sums:
2.6.32-rc4
16 corrupt ( parts of files )
120 not found (!!??)

2.6.31
7 corrupt

2.6.32-rc4 + auto_da_alloc=0
nothing lost :)

for all test was used same script.
Comment 67 Oleksij Rempel (fishor) 2009-10-16 20:07:00 UTC
should i do bisect?
Comment 68 Anonymous Emailer 2009-10-16 22:24:22 UTC
Reply-To: parag.lkml@gmail.com

On Fri, Oct 16, 2009 at 5:15 AM, Theodore Tso <tytso@mit.edu> wrote:
>
> A number of people have reported this, and there is some discussion
> and some suggestions that I've made here:
>
>        http://bugzilla.kernel.org/show_bug.cgi?id=14354

Ok, I went through this bug report and here are some immediately
useful things to note -
1) I am not running Karmic - I am running Jaunty x86_64 clean install
2) I am not using dm/lvm or anything fancy for that matter
3) So far, I have been able to reproduce this problem by just hitting
the power button on my laptop when it is doing nothing. It also
happens when waking up from s2ram and the laptop wasn't doing anything
when it was suspended (I mean I wasn't copying/deleting stuff, wasn't
running make - laptop was

> So if you can come up with a reliable reproduction case, and don't
> mind doing some experiments and/or exchanging debugging correspondance
> with me, please let me know.  I'd **really** appreciate the help.

My laptop is a reliable test case - for me at least! I tried just now
to abruptly reset the laptop and upon reboot there was fsck followed
by another reboot only to have X fail to start and NetWorkManager
segfault. At this point I am pretty sure I can reproduce it just be
power cycling the laptop using the power button.
After another fsck and a reboot it finally comes up.

>
> Information that would be helpful to me would be:
>
> a) Detailed hardware information (what type of disk/SSD, what type of
> laptop, hardware configuration, etc.)

SSD is a Corsair P256 latest firmware, been used on other machine
without any issues.
Laptop is HP EliteBook 8530p, 4GB RAM, Intel T9400 CPU, Intel WiFi,
ATI 3650 GPU.
No proprietary drivers ever loaded.


>
> b) Detailed software information (what version of the kernel are you
> using including any special patches, what distro and version are you
> using, are you using LVM or dm-crypt, what partition or partitions did
> you have mounted, was the failing partition a root partition or some
> other mounted partition, etc.)
It happens on current custom compiled git - both on custom and minimal
localmodconfig.
It also happens on Ubuntu daily kernel PPA builds.

>
> c) Detailed reproduction recipe (what programs were you running before
> the crash/failed suspend/resume, etc.)
>
Really nothing special - I boot to desktop, may be open FireFox for
few minutes and then try reset.
fsck then reports a bunch of errors and forces reboot. On reboot X
fails to start, file system although mounted rw cannot be written to -
vim for instance won't open any file due to write errors. Another fsck
finds few more problems (or sometimes not) and reboot brings it back
to desktop.

So my problem is not corruption really but the amount and nature of
errors fsck encounters and corrects on unclean shutdown and the write
failures until another fsck -f finds more problems and reboots. None
of this happens on any other filesystem including the /boot ext3 fs on
the same disk.

>
> If you do decide to go hunting this problem, one thing I would
> strongly suggest is that either to use "tune2fs -c 1 /dev/XXX" to
> force a fsck after every reboot, or if you are using LVM, to use the
> e2croncheck script (found as an attachment in the above bugzilla entry
> or in the e2fsprogs sources in the contrib directory) to take a
> snapshot and then check the snapshot right after you reboot and login
> to your system.  The reported file system corruptions seem to involve
> the block allocation bitmaps getting corrupted, and so you will
> significantly reduce the chances of data loss if you run e2fsck as
> soon as possible after the file system corruption happens.  This helps
> you not lose data, and it also helps us find the bug, since it helps
> pinpoint the earliest possible point where the file system is getting
> corrupted.

I have enabled fsck on every mount but I am not certain ongoing "clean
state" corruption is the problem in my case.
Things have worked well without any trouble if I don't end up doing a
unclean shutdown.

[ snip]

> I'm going to be at the kernel summit in Tokyo next week, so my e-mail
> latency will be a bit longer than normal, which is one of the reason
> why I've left a goodly list of potential experiments for people to
> try.  If you can come up with a reliable regression, and are willing
> to work with me or to try some of the above mentioned tests, I'll
> definitely buy you a real (or virtual) beer.
>

I will try the things you mentioned - finding if this happens in pre
.32 kernels is the first one on my list followed by reverting the
specific commits you mentioned, followed if necessary by complete
bisection.

I am afraid however this is not a regression - at least not a recent
one, as I have had this experience with ext4 and unclean shutdowns
since long time.
And that's on different hardware/different disks.

Thanks,

Parag
Comment 69 Brian Rogers 2009-10-16 23:02:32 UTC
I have an idea for anyone who gets this problem on a clean reboot. Do what it takes to reproduce the problem, then quit to single-user mode (sudo telinit 1) and try to remount root as read-only:

mount -o ro,remount /

If you can't do this because the filesystem is busy, look for and kill any process that might have a file open for writing. Remounting as read-only does not seem to be forcible, as the command appears to succeed, but you still have write access. So it's a good idea to 'touch /blah' to make sure you've turned off write access.

Plan A. If you've successfully remounted as read-only, run 'e2fsck -f' on your filesystem. If you had no errors on the filesystem before, but you do now, you know the errors were introduced during the session. If there are no errors on this check and you never get the corruption on the next boot after doing this, it's probably a problem with data not getting flushed by the time the system reboots.

Plan B. If you can't remount as read-only and can't find a process responsible for that to kill, run sync, then 'e2fsck -fn' on the partition and report the results.

I just did a dist-upgrade on my desktop machine running Karmic and the standard Karmic 2.6.31 kernel, and I was going to do the above before rebooting, but had to resort to plan B because I couldn't remount / as read-only. Even after sync, e2fsck warned that it was skipping journal recovery, and it reported errors.

Ted, is it to be expected that e2fsck would find a journal recovery necessary and see errors even if the filesystem was still mounted rw, but had been synced and no write access had taken place since the sync? I know no writing had taken place because I could run sync after the drive had spun down, and it wouldn't spin up again. Yet e2fsck continued to report the same result.

I did 'reboot -f' after this to avoid any script shenanigans and used 'init=/bin/bash' on the next boot so I could manually examine the situation. When my root fs was been mounted read-only, it replayed the journal, then there were no errors on 'e2fsck -f'.
Comment 70 Holger Freyther 2009-10-17 02:27:10 UTC
 I have no idea if that is useful for anyone. I have done the following.

0.) use dd to create a big file
1.) install a OpenSuSE 11.1 in it using kvm
2.) change the /etc/fstab and /sbin/init
2.1) /sbin/init is doing
      mount /proc
      fsck -f /dev/hda2
      echo "press any key to continue"
      mount -oremount,rw /
      cd /source/linux-2.6
      rm -rf *
      rm -f .git/index.lock
      git checkout .
      make -j 6
      make -j 6 modules_install

3.) Use kvm to start with the disk and an external kernel '-kernel ./bzImage -append root=/dev/hda2"
4.) During the "make -j 6" I interrupt the kernel compile.

Result every ~10th time the filesystem is left with some errors (where you have to Y/N for fsck) after the journal recovery. The errors are lost inodes (to be attached to lost+found), zero length inodes and such.


The benefit of using the above setup:
   - non karmic userspace (kvm still running on karmic though)
   - no dm (disk image still hosted on a dm device)
   - can copy/clone the disk image for each kernel run and use debugfs to look at it.


I will post the next fsck log with errors and then I can upload it somewhere if it is of any interest.
Comment 71 Oleksij Rempel (fishor) 2009-10-17 06:01:45 UTC
you don't have to edit init script, all you need is to boot with "break=mount". On ubuntu/debian this will stop boot in initrd  without mounted root. After this you can mount it read only or what ever you need. But you can't fsck it, initrd do not include such tools. Continue:
mount -o ro /dev/sda1_your_root /root
mount -o bind /dev /root/dev
chroot /root /bin/bash

now you'll have clean mount without any other corruptions caused by init script bug.

Please can anyone place this "auto_da_alloc=0". On my system this work around data lost issue. Can any one confirm this too?
Comment 72 Oleksij Rempel (fishor) 2009-10-17 06:03:19 UTC
/place "auto_da_alloc=0 to fsck/
Comment 73 Oleksij Rempel (fishor) 2009-10-17 06:12:43 UTC
oh... man... s/fsck/fstab/g

cat fstab:
/dev/sda1 /               ext4    auto_da_alloc=0,errors=remount-ro 0       1
Comment 74 Maxim Levitsky 2009-10-17 06:38:38 UTC
Created attachment 23442 [details]
kernel log

dmesg when I noticed corruption
Comment 75 Maxim Levitsky 2009-10-17 06:38:58 UTC
Created attachment 23443 [details]
fsck log
Comment 76 Maxim Levitsky 2009-10-17 06:43:47 UTC
Happened on my desktop system. It has ubuntu 9.04, ext4, and intel drivers.

I was troubleshooting a GPU hang that happened after I switch modes in an SDL game.

Same pattern, few crashes, duplicated blocks, and now I set the fsck to run after each reboot.

And I had a crash before this, and thus it very likely that indeed free block bitmap is corrupted.
Comment 77 Oleksij Rempel (fishor) 2009-10-17 08:18:09 UTC
i tested 2.6.30, nothing in fsck.log or dmesg, md5sum found 6 corrupted files.
I assume  corruted files is not a bug :) and begin to bisect lost files between 2.6.31 and 2.6.32-rc4
Comment 78 Theodore Tso 2009-10-17 10:51:41 UTC
Alexey,

There is a very big difference between _files_ being corrupted and the
_file_ _system_ being corrupted.  Your test, as I understand it, is a
"make modules_install" from a kernel source tree, followed immediately
by a forced crash of the system, correct?  Are you doing an "rm -rf
/lib/modules/2.6.32-XXXX" first, or are you just doing a "make
modules_install" and overwriting files.

In any case, if you don't do a forced sync of the filesystem, some of
the recently written files will be corrupted.  (Specifically, they may
be only partially written, or truncated to zero-length.)  This is
normal and to be expected.  If you want to make sure files are written
to stable storage, you *must* use sync or fsync(3).

This is true for pretty much any file system, by the way.  If you have
a script that looks something like this

#!/bin/sh
rm -rf /lib/modules/`uname -r`
make modules_install
echo c > /proc/sysrq-trigger

you _will_ end up with some files being missing, or not fully written
out.  Try it with ext3, xfs, btrfs, reseirfs.  All Unix filesystems
have some amount of asynchronous writes, because otherwise performance
would suck donkey gonads.  You can try to mount with -o sync, just to
see how horrible things would be.

So what do you do if you have a "precious" file --- a file where you
want to update its contents, but you want to make absolutely sure
either the old file or the new file's contents will still be present?
Well, you have to use fsync().  Well-written text editors and things
like mail transfer angents tend to get this right.  Here's one right
way of doing it:

1)  fd = open("foobar.new", O_CREAT|O_TRUNC, mode_of_foobar);
2)  /* copy acl's, extended attributes from foobar to foobar.new */
3)  write(fd, buf, bufsize); /* Write the new contents of foobar */
4)  fsync(fd);
5)  close(fd);
6)  rename("foobar.new", "foobar");

The basic idea is you write the new file, then you use fsync() to
guarantee that the contents have been written to disk, and then
finally you rename the old file on top of the old one.

As it turns out, for a long time Linux systems were drop dead
reliable.  Unfortunately, recently with the advent of ACPI
suspend/resume, which assumed that BIOS authors were competent and
would test on OS's other than windows, and proprietry video drivers
that tend to be super unreliable, Linux systems have started crashing
more often.  Worse yet, application writers are started getting
sloppy, and would write code sequences like this when they want to
update files:

1)  fd = open("foobar", O_CREAT|O_TRUNCATE, default_mode);
2)  write(fd, buf, bufsize); /* write the new contents of foobar */
3)  close(fd);

Or this:

1)  fd = open("foobar.new", O_CREAT|O_TRUNC, mode_of_foobar);
2)  write(fd, buf, bufsize); /* Write the new contents of foobar */
3)  close(fd);
4)  rename("foobar.new", "foobar");

I call the first "update-via-truncate" and the second
"update-via-replace".  Because with delayed allocation, files have a
tendency to become zero-length if you update them without using
fsync() and than an errant ACPI bios or buggy video driver takes your
system down --- and because KDE was updating many more dot files than
necessary, and firefox was writing half a megabyte of disk files for
every single web click, people really started to notice problems.

As a result, we have hueristics that detect update-via-rename and
update-via-truncate, and if we detect this write pattern, we force a
background writeback of that file.  It's not a synchronous writeback,
since that would destroy performance, but a very small amount of time
after a close(2)'ing a file descript that was opened with O_TRUNCATE
or which had been explicitly truncated down to zero using ftruncate(2)
-- i.e., update-via-truncate --- , or after a rename(2) which causes
an inode to be unlinked --- i.e., uodate-via-unlink --- the contents
of that file will be written to disk.  This is what auto_da_alloc=0
inhibits.

So why is it that you apparently had no data loss when you used
auto_da_alloc=0?  I'm guessing because the file system activity entire
script fit within a single jbd2 transaction, and the transaction never
committed before the script forced a system crash.  (Normally a
transaction will contain five seconds of filesystem activity, unless
(a) a program calls fsync(), or (b) there's been enough file system
activity that a significant chunk of the journal space has been
confused.

One of the changes between 2.6.31 and 2.6.32-rc1 was a bugfix that
fixed a problem in 2.6.31 where update-via-truncate wasn't getting
detected.  This got fixed in 2.6.32-rc1, and that does change when
data gets forced out to disk.

But in any case, if it's just a matter of the file contents not
getting written to disk, that's expected if you don't use fsync() and
you crash immediately afterwards.  As I said earlier, all file systems
will tend to lose data if you crash without first using fsync().

The bug which I'm interested in replicating is one where the actual
_file_ _system_ is getting corrupted.  But if it's just a matter of
not using sync() or fsync() before a crash, that's not a bug.

    	  	    	    	     - Ted
Comment 79 Oleksij Rempel (fishor) 2009-10-17 11:37:24 UTC
Thank you for explanation. While bisect fsck.log was for me like: i have broken files WTF fsck do not telling me anything. If you mean with "files system corruption": corrupted orphan, unattached inode, block bitmap difference...; yes this test clearly reproduce it. But only with md5sum i kan see how big are real damages.

Here is the script i do:
make modules_install
echo "1v3 done"
sleep 5m 
make modules_install
echo "2v3 done"
sleep 5m
make modules_install
echo "3v3 done do checksum"

mount /dev/sda13 /mnt
md5deep -r /lib/modules/$k_version/* > /mnt/md5sums
umount /mnt

sleep 1m
echo c > /proc/sysrq-trigger
Comment 80 Eric Sandeen 2009-10-17 14:37:49 UTC
(taking bug temporarily just so I can edit the attachment mime types...)
Comment 81 Eric Sandeen 2009-10-17 14:40:03 UTC
(oh well, I guess even the assignee can't do that - setting text attachments as text/plain makes reading them a bit simpler...)
Comment 82 Oleksij Rempel (fishor) 2009-10-17 15:23:20 UTC
I double checking it now, but is seems to be this patch:

commit 91ac6f43317c0bf99969665f98016548011dfa38
Author: Frank Mayhar <fmayhar@google.com>
Date:   Wed Sep 9 22:33:47 2009 -0400

    ext4: Make non-journal fsync work properly
    
    Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
    mode:  If we're not using a journal, ext4_write_inode() now calls
    ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
    with a new "do_sync" parameter.  If that parameter is nonzero _and_ we're
    not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
    instead of ext4_handle_dirty_metadata().
    
    This problem was found in power-fail testing, checking the amount of
    loss of files and blocks after a power failure when using fsync() and
    when not using fsync().  It turned out that using fsync() was actually
    worse than not doing so, possibly because it increased the likelihood
    that the inodes would remain unflushed and would therefore be lost at
    the power failure.
    
    Signed-off-by: Frank Mayhar <fmayhar@google.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Comment 83 Oleksij Rempel (fishor) 2009-10-17 17:46:11 UTC
i checked it one more time and i confirm my  previous post.
Comment 84 Theodore Tso 2009-10-17 19:57:00 UTC
Alexey, how did you check it?   What happens when you use build commit 91ac6f43?   Does the output of fsck show file system corruption after running your script?   And what happens if you build commit fe188c0e (aka commit 91ac6f43^ aka the parent or immediate predecessor of commit 91ac6f43)?   Does the problem reliably go away, which is to say fsck no longer reports file system corruption after you use the immediate predecessor of commit 91ac6f32, but which is there when you use commit 91ac6f32?

This commit is supposed to not change anything if you use it with a journal (which is the ext4 default).   I've reviewed the patch again, and it looks like it should be changing anything for a default ext4 filesystem with a journal present --- which is why I'm asking you exactly how you confirmed that this commit is the problem one.   Maybe I'm missing something, but I don't see how the presence or absence of this patch could be changing anything.

Thanks for all of your work, even if it's causing me to be even more mystified about what is going on.
Comment 85 Oleksij Rempel (fishor) 2009-10-17 20:54:10 UTC
Created attachment 23456 [details]
logs of all bisected kernels

Here are logs of all tested kernels, for all test was used same script from my previous post. Only 2.6.32-rc4-00094-g80f5069-dirty and 2.6.31 should be a bit different... for this kernel i had thirst version of script.

After crash i started into initrd with option break=mount. So wen can exclude any karmik initscript bugs. After it i mounted ro rootfs and chrooted to it. At this point i run my checkfs script:
---------------------------------------------------------------------------
#!/bin/bash

current_date=$( date +%d_%m_%Y-%H_%M );
k_version=$( uname -r );

mount /dev/sda13 /mnt
cd /mnt

fsck -y -f /dev/mapper/zwerg_buntu-root_one > $k_version-$current_date.fsck
dmesg > $k_version-$current_date.dmesg
md5sum -c --quiet /mnt/md5sums > $k_version-$current_date.md5
-------------------------------------------------------------
I hope this will help you. Please check the logs if there any mistakes.

If you need more info, tell me soon. My system is badly broken now, i will need to reinstall it.
Comment 86 Oleksij Rempel (fishor) 2009-10-17 20:59:07 UTC
With commit fe188c0e i get "clean" fsck log with 91ac6f43 i get error like:

Free blocks count wrong for group #0 (8935, counted=8922).

Entry 'libnss_wins.so.2' in /var/spool/postfix/lib (155666) has deleted/unused inode 136151.  Clear? yes

Running additional passes to resolve blocks claimed by more than one inode...
Pass 1B: Rescanning for multiply-claimed blocks
Multiply-claimed block(s) in inode 131134: 557394
Multiply-claimed block(s) in inode 131201: 557394

... you'll see it. To be sue i double checked two this kernels.
Comment 87 Jindrich Makovicka 2009-10-18 07:44:25 UTC
(In reply to comment #84)

Maybe this is a totally stupid suggestion, but is there any chance for this code path to be executed during journal replay, when s_journal is still NULL ?
Comment 88 Oleksij Rempel (fishor) 2009-10-18 10:06:25 UTC
I reverted this patch manually against v2.6.32-rc5. Now there is no fs corruption and sometimes no file corruption. First part of patch (ext4_do_update_inode) was ready reverted, i did only ext4_write_inode part.
Comment 89 Jindrich Makovicka 2009-10-18 11:23:31 UTC
Created attachment 23458 [details]
skip the non-journal branch if readonly
Comment 90 Jindrich Makovicka 2009-10-18 11:25:15 UTC
Comment on attachment 23458 [details]
skip the non-journal branch if readonly

Can you try this patch on top, if it makes any difference?
Comment 91 Oleksij Rempel (fishor) 2009-10-18 11:57:52 UTC
This patch working too, no fs corruption, no files corruption too.
Comment 92 Theodore Tso 2009-10-19 21:07:37 UTC
Created attachment 23468 [details]
Testing patch; ro-diff with a WARN_ON(1)

Could you try this patch on top of a stock 2.6.32-rc kernel?  ext4_write_inode() should not be called before the journal is set up, and I'm not able to see how this could work.  Yet if the ro-diff.patch in attachment #23458 [details] makes a difference working, then that must be what's happening.

This is a variant on that patch that executes a printk and a WARN_ON(1) when if the conditional triggers; that will allow me to figure out how ext4_write_inode() is getting called early.  I'm not able to replicate this on my system with this patch, but if it's working for you, then this should allow me to see what's actually happening.   If we need to simply revert the change to ext4_write_inode(), I can do that, but I'd really like to see what's going on first.   If in fact there is a patch where ext4_write_inode() can get called before ext4_load_journal(), which is what sets EXT4_SB(sb)->s_journal, then there's opportunities for other mischief/data corruption.
Comment 93 Theodore Tso 2009-10-19 21:08:27 UTC
Typo correction: "If there is a patch where ext4_write_inode() can get called" should have read "If there is a code path where ext4_write_inode() can get called..."
Comment 94 Oleksij Rempel (fishor) 2009-10-20 17:28:29 UTC
Hmm... whíth this patch i clearly do not have fs corruption but i don't see "Filesystem is read-only" in dmesg o_O . Currently i can't get dmesg with oops, there is no com connector and i  filed to setup netconsole..  (i did it defore and it worked).

Probably "msleep(1)" instead of this patch  will have same result *jock*, not tested.
Comment 95 Oleksij Rempel (fishor) 2009-10-20 20:21:01 UTC
Created attachment 23480 [details]
kernel config

How about my kernel config? i use almost the same config on all my pc's on different platforms i386/amd64. CPU - pentium M, athom, core2duo E8400O; RAM - 1.5G, 2G, 4G; i915/ich6, g945/ich7, g45/ich10r. ...

So i don't think , it is about hardware... probably it is about kernel config.

Alexey
Comment 96 Theodore Tso 2009-10-21 00:52:57 UTC
Alexey, what version of GCC are you using?   If applying the patch in attachment #23468 [details] stops file system corruption, but you're not seeing the message or the kernel stack dump that would be triggered by WARN_ON(1) in your logs, then the conditional is never triggering.   So the only thing that might be making a difference is if this is either moving some code around or some kind of compiler bug.   Something really doesn't make sense, and there is definitely something very subtle going on here.
Comment 97 Theodore Tso 2009-10-21 00:57:46 UTC
I'd appreciate by the way getting reports from other people who have been suffering file system corruption problems whether or not the patch in attachment #23468 [details] helps, and whether they are seeing any "File system is read-only" followed by a kernel stack trace in their system logs.  Thanks!!
Comment 98 Oleksij Rempel (fishor) 2009-10-21 05:55:49 UTC
I use "gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1", you can see it in dmesg.
Are there any way to disasm this function?
Comment 99 Oleksij Rempel (fishor) 2009-10-21 09:04:58 UTC
Created attachment 23486 [details]
disasm with patch

This is disasm of ext4_write_inode with patch
Comment 100 Oleksij Rempel (fishor) 2009-10-21 09:06:54 UTC
Created attachment 23487 [details]
disasm without patch

disasm without patch.
Comment 101 Eric Sandeen 2009-10-22 18:43:36 UTC
(In reply to comment #87)
> (In reply to comment #84)
> 
> Maybe this is a totally stupid suggestion, but is there any chance for this
> code path to be executed during journal replay, when s_journal is still NULL
> ?

If I'm understanding Alexy's test correctly, he's not even doing journal replay; he's doing IO, crashing, and then running e2fsck -f -y immediately upon reboot w/o any fileystem mount (and therefore no journal replay)

Alexy, is that correct?
Comment 102 Eric Sandeen 2009-10-22 19:06:05 UTC
Alexey, one other question, after you did this testcase for your bisect run:

mount /dev/sda13 /mnt
md5deep -r /lib/modules/$k_version/* > /mnt/md5sums
umount /mnt

sleep 1m
echo c > /proc/sysrq-trigger

did you reset power to get the machine up and running again?

Thanks,
-Eric
Comment 103 Oleksij Rempel (fishor) 2009-10-22 19:22:25 UTC
(In reply to comment #101)
> Alexy, is that correct?

No. ... crash, reboot, mount readonly, chroot on mounted partition and fsck mounted partition.
But it is not what about happen after reboot. I used "clean" kernel after dirty crash with same fs corruption.
Comment 104 Oleksij Rempel (fishor) 2009-10-22 19:29:01 UTC
(In reply to comment #102)

> did you reset power to get the machine up and running again?

Yes i reset the power, suddenly i can't use sysrq to reboot.
Comment 105 Eric Sandeen 2009-10-22 19:37:44 UTC
(In reply to comment #103)
> (In reply to comment #101)
> > Alexy, is that correct?
> 
> No. ... crash, reboot, mount readonly, chroot on mounted partition and fsck
> mounted partition.

Oh, so you are e2fsck'ing a readonly mounted root partition...  Ok I had misunderstood that part, sorry.
Comment 106 Eric Sandeen 2009-10-22 21:58:08 UTC
Created attachment 23494 [details]
Alexey's x86_64 config.

well, Alexey's config seems to have done the trick (read: hosed my test box) for me.  Attached the x86_64 version he sent me.  I got some serious multiply-caimed block errors after an apparently failed boot of that kernel.
Comment 107 Eric Sandeen 2009-10-22 22:24:47 UTC
hrm, or not.  I thought the kernel had gotten power-cycled after booting, but since it had no serial console compiled in, I didn't see that it actually apparently didn't boot at all.  Hrm.  Still, I have seen problems, so I have hope yet for narrowing down a reproducer ...
Comment 108 James Cloos 2009-10-23 04:30:36 UTC
> I'd appreciate by the way getting reports from other people who have been
> suffering file system corruption problems whether or not the patch in
> attachment #23468 [details] helps, and whether they are seeing any "File system is
> read-only" followed by a kernel stack trace in their system logs.  Thanks!!

I got some corruption, but not ro changes.

In an early rc nothing seemed wrong until, after a reboot, I saw enough
corruption in a db that I had to dump, clean up several bogus records
and restore it.  (It was a busy but ephemeral db, so that wasn't too
much of a problem.

Today, however, running -rc5 the box simply rebooted without notice.
and I got a number of log entries after the reboot.  Most specified
ext4_mb_generate_buddy and jbd complained that it spotted a dirty
metadata buffer.  There were also a couple of bad header/extents.

I stut down enough to unmoount /var and ran e2fsck.  That generated a
*slew* of errors, mostly complaining about multiple files claiming
the same blocks.  A second e2fsck right after (with -f) showed no
further errors.  I now have about 70 megs of data in lost+found.

The box is only partly up right now (I'm writing this over ssh);
I'll be able to provide more details after a reboot....
Comment 109 Theodore Tso 2009-10-23 07:44:59 UTC
James,

>I got some corruption, but not ro changes.

When you say corruption, you mean "file data corruption", right?  You said it was a database; was it a database that uses fsync() --- or put another way, was the database (could you tell us what database it was) one that claims to have ACID properties?

>Today, however, running -rc5 the box simply rebooted without notice.  and I
>got
>a number of log entries after the reboot...  I stut down enough to 
>unmount /var and ran e2fsck.  That generated a *slew* of errors, mostly
>complaining about multiple files claiming the same blocks.  A second e2fsck
>right after (with -f) showed no further errors.  I now have about 70 megs
>of data in lost+found.

Hmm, this sounds like the patch didn't actually help.   And am I right that you never saw the "filesystem is readonly" plus a kernel stack dump in your system logs or in dmesg?   The other thing which is interesting is that this happened on a non-root filesystem (/var), which means that journal wasn't replayed when the root filesystem was mounted read-only, but the journal was replayed by e2fsck.

Another question --- did you have your file system configured with "tune2fs -c 1", was described in comment #59?   One worry I have is that in fact the file system may have been corrupted earlier, and it simply wasn't noticed right away.   In the case of fsck complaining about blocks claimed by multiple inodes, there are two causes of this.  One is that one or more blocks in the inode table get written to the wrong place, overwriting another block(s) in the inode table.  In that case, the pattern of corruption tends to be that since inode N is written on top of inode N+m*16 or N+m*32 (depending on whether you are using 128-byte or 256-byte sized inodes) and inode N+1 is written on top of inode (N+1)+(m+16) or (N+1)+(m*32). it's quite easy to see this pattern from the e2fsck transcript.

The second case is one where the block allocation bitmap gets corrupted, such that some blocks which are in use are marked as free, and *then* the file system is remounted and files are written to the file system, such that the blocks are reallocated for new files.   In that case, the pattern of the multiply-claimed blocks is random, and it's likely that you will see one or more inodes where the inode is sharing blocks with more than one inodes, and where there is no relationship between the inode numbers of inodes that are using a particular block.

So far, the fsck transcripts with pass1b that people have submitted to me tend to be of the second form --- which is why I recommend the use of "tune2fs -c 1"; if the file system corruptions causing data loss are caused by corrupted block allocation blocks, then checking the filesystems after every single boot means that you might see pass 5 failures, but not the pass1b failures that are associated with data loss.

Obviously, we don't want to run with "tune2fs -c 1" indefinitely, since that obviously slows down boot times, but for people who are interested in helping us debug this issue, it should allow them to avoid data loss and also help us identify *when* the file system had gotten corrupted (i.e., during the previous boot session), and hopefully allow us to find the root cause to this problem.
Comment 110 Maxim Levitsky 2009-10-24 17:04:29 UTC
I got some bad news for you.

I have just finished recovering from another semi-severe corruption, this time again on my laptop.

This time:

1 - there were no crashes, for last week, and no s2ram/disk was ever attempted.
2 - tune2fs -c 1 was used, and I know for sure that fsck did run every boot
(tested without usplash, and with (had very slow boot as expected))

Yet, today, I did usual 'apt-get' update, I updated -git version of NetworkManager I use, and noticed that it won't connect via pptp. So to be sure I rebooted the system, fsck failed, and large corruption is there...


So it looks like bitmaps are corrupted during runtime, and crashes aren't the cause.

The rest in attached log
Comment 111 Maxim Levitsky 2009-10-24 17:05:30 UTC
Created attachment 23514 [details]
fsck log

This is the new log
Comment 112 Oleksij Rempel (fishor) 2009-10-25 08:58:06 UTC
I was able to reproduce fs corruption by passing barrier=0, _but_ i can't be sure burriers are disabled for my dm.

fstab say, disabled:
~$ grep barr /etc/fstab
-----------------------------------------------------------------
/dev/mapper/zwerg_buntu-root_one /               ext4    errors=remount-ro,barrier=0 0       1
/dev/sda8 /home           ext4    defaults,barrier=0        0       2

-------------------------------------------------------------------

proc say, disabled:
$ grep barr /proc/mounts 
----------------------------------------------------------
/dev/mapper/zwerg_buntu-root_one / ext4 rw,relatime,errors=remount-ro,barrier=0,data=ordered 0 0
/dev/sda8 /home ext4 rw,relatime,barrier=0,data=ordered 0 0
-------------------------------------------------------------

dmesg say, _enabled_:
kernel 2.6.31-14-generic
~$ dmesg | grep barr
[    7.063589] EXT4-fs (dm-0): barriers enabled
[    9.365835] EXT4-fs (sda8): barriers disabled

kernel 2.6.32-rc5-00011-g2fdc246
[    9.365835] EXT4-fs (sda8): barriers disabled

i know 2.6.32-rc do not printk if barriers are enabled, so i assume they are still enabled.
I think this mess with dmesg vs proc should be fixed too.
Comment 113 Oleksij Rempel (fishor) 2009-10-25 13:59:12 UTC
apt-get update && apt-get upgrade && reboot is really good way to reproduce fs corruption. After this you need reinstall your system.
Comment 114 Oleksij Rempel (fishor) 2009-10-25 14:01:01 UTC
Are here only debian and ubuntu? any other distributions?
Comment 115 Anonymous Emailer 2009-10-25 19:04:20 UTC
Reply-To: pavel@ucw.cz

Hi!

>>> So I have been experimenting with various root file systems on my
>>> laptop running latest git. This laptop some times has problems waking
>>> up from sleep and that results in it needing a hard reset and
>>> subsequently unclean file system.
>>>      
>> A number of people have reported this, and there is some discussion
>> and some suggestions that I've made here:
>>
>>      http://bugzilla.kernel.org/show_bug.cgi?id=14354
>>
>> It's been very frustrating because I have not been able to replicate
>> it myself; I've been very much looking for someone who is (a) willing
>> to work with me on this, and perhaps willing to risk running fsck
>> frequently, perhaps after every single unclean shutdown, and (b) who
>> can reliably reproduce this problem.  On my system, which is a T400
>> running 9.04 with the latest git kernels, I've not been able to
>> reproduce it, despite many efforts to try to reproduce it.  (i.e.,
>> suspend the machine and then pull the battery and power; pulling the
>> battery and power, "echo c>  /proc/sysrq-trigger", etc., while
>> doing "make -j4" when the system is being uncleanly shutdown)
>>    
>
> I wonder if we might have better luck if we tested using an external  
> (e-sata or USB connected) S-ATA drive.
>
> Instead of pulling the drive's data connection, most of these have an  
> external power source that could be turned off so the drive firmware  
> won't have a chance to flush the volatile write cache. Note that some  
> drives automatically write back the cache if they have power and see a  
> bus disconnect, so hot unplugging just the e-sata or usb cable does not  
> do the trick.
>
> Given the number of cheap external drives, this should be easy to test  
> at home....

Do they support barriers?

(Anyway, you may want to use some kind of VM for testing. That should
make the testing cycle shorter, easier to reprorduce *and* more repeatable.)

									Pavel
Comment 116 Anonymous Emailer 2009-10-26 13:46:30 UTC
Reply-To: rwheeler@redhat.com

On 10/25/2009 02:22 AM, Pavel Machek wrote:
> Hi!
>
>>>> So I have been experimenting with various root file systems on my
>>>> laptop running latest git. This laptop some times has problems waking
>>>> up from sleep and that results in it needing a hard reset and
>>>> subsequently unclean file system.
>>>>
>>> A number of people have reported this, and there is some discussion
>>> and some suggestions that I've made here:
>>>
>>>     http://bugzilla.kernel.org/show_bug.cgi?id=14354
>>>
>>> It's been very frustrating because I have not been able to replicate
>>> it myself; I've been very much looking for someone who is (a) willing
>>> to work with me on this, and perhaps willing to risk running fsck
>>> frequently, perhaps after every single unclean shutdown, and (b) who
>>> can reliably reproduce this problem.  On my system, which is a T400
>>> running 9.04 with the latest git kernels, I've not been able to
>>> reproduce it, despite many efforts to try to reproduce it.  (i.e.,
>>> suspend the machine and then pull the battery and power; pulling the
>>> battery and power, "echo c>   /proc/sysrq-trigger", etc., while
>>> doing "make -j4" when the system is being uncleanly shutdown)
>>>
>>
>> I wonder if we might have better luck if we tested using an external
>> (e-sata or USB connected) S-ATA drive.
>>
>> Instead of pulling the drive's data connection, most of these have an
>> external power source that could be turned off so the drive firmware
>> won't have a chance to flush the volatile write cache. Note that some
>> drives automatically write back the cache if they have power and see a
>> bus disconnect, so hot unplugging just the e-sata or usb cable does not
>> do the trick.
>>
>> Given the number of cheap external drives, this should be easy to test
>> at home....
>
> Do they support barriers?
>
> (Anyway, you may want to use some kind of VM for testing. That should
> make the testing cycle shorter, easier to reprorduce *and* more repeatable.)
>
>                                                                       Pavel
>

The drives themselves will support barriers - they are the same S-ATA/ATA drives 
you get normally for your desktop, etc.

I think that e-SATA would have no trouble (but fewer boxes have that external 
S-ATA port). Not sure how reliable the SCSI -> USB -> ATA conversion is for USB 
drives though (a lot of moving pieces there!).

VM testing is a good idea, but I worry that the virtual IO stack support for 
data integrity is still somewhat shaky. Christoph was working on fixing various 
bits and pieces I think...

ric
Comment 117 Linus Torvalds 2009-10-26 15:42:41 UTC
Just a note: I'm seeing this too on my new Dell laptop, with a fresh 
Fedora-12 Beta install (+ "yum update" + current -git kernel).

The root filesystem was seriously scrogged, and I had the RPM databases 
corrupted. Don't know what else is wrong, I'm rebuilding them now.

I had a few unclean shutdowns (debugging wireless driver etc), but they 
weren't during any heavy disk activity. And I'm pretty sure they weren't 
during any yum update, so the rpm database corruption smells like ext4 is 
not writing back inode information in a timely manner at all (ie any rpm 
database activity would have happened much earlier).

Example kernel messages appended.  I probably missed a few.  I do not have 
fsck logs (this was all on the root filesystem, nowhere to log them), but 
they weren't pretty. 

I can do pretty much anything to this laptop (I'm just testing it), so I'm 
open to testing.

		Linus

---
Oct 25 10:46:00 localhost kernel: device-mapper: multipath: version 1.1.0 loaded
Oct 25 10:46:00 localhost kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode
Oct 25 10:46:00 localhost kernel: scsi 4:0:0:0: Direct-Access     Generic- Multi-Card       1.00 PQ: 0 ANSI: 0 CCS
Oct 25 10:46:00 localhost kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0
Oct 25 10:46:00 localhost kernel: sd 4:0:0:0: [sdb] Attached SCSI removable disk
Oct 25 10:46:00 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 2: 5553 blocks in bitmap, 5546 in gd
Oct 25 10:46:00 localhost kernel: JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
...
Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089
Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089
Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089
Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089
Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089
Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089
Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089
...
Oct 25 10:46:59 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 17: 3663 blocks in bitmap, 3664 in gd
...
Oct 26 07:54:58 localhost kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode
Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 3: 4789 blocks in bitmap, 3785 in gd
Oct 26 07:54:58 localhost kernel: JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 5832
Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482
Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485
Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486
...
Oct 26 07:55:22 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 657981
Oct 26 07:55:22 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 657981
...
Oct 26 08:11:39 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486
Oct 26 08:11:40 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482
Oct 26 08:11:41 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485
Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 6: 6342 blocks in bitmap, 6330 in gd
Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 16: 17220 blocks in bitmap, 18932 in gd
Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 26: 6769 blocks in bitmap, 4721 in gd
Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 27: 15419 blocks in bitmap, 1081 in gd
Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 28: 8309 blocks in bitmap, 1879 in gd
Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 29: 5982 blocks in bitmap, 1880 in gd
Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 30: 9476 blocks in bitmap, 3886 in gd
Oct 26 08:12:12 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482
Oct 26 08:12:12 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485
Oct 26 08:12:12 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486
Oct 26 08:12:14 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482
Oct 26 08:12:14 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485
Oct 26 08:12:14 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486
Comment 118 Eric Sandeen 2009-10-26 16:17:44 UTC
Linus, are you seeing the fsck & runtime errors after a clean shutdown, or only after a crash/reset/poweroff ?  Or both?  (we seem to potentially have reports of both...)

Thanks,
-Eric
Comment 119 Linus Torvalds 2009-10-26 19:13:13 UTC
It seems to be only after an unclean shutdown for me, judging by some judicious grepping of /var/log/messgaes. At least the ext4 error messages in dmesg only seem to happen when I don't see the line "Kernel logging (proc) stopped." in the messages file before the new boot.

Of course, maybe the problem is caused by - or even just made visible by - not the unclean shutdown itself, but the fsck or the journal replay which is just triggered by the unclean shutdown. IOW, it could be secondary damage. That sounds unlikely, though.

I dunno. As mentioned, I don't think that I was doing anything particularly write-heavy when I forced the unclean shutdown.
Comment 120 Jindrich Makovicka 2009-10-26 21:46:32 UTC
(In reply to comment #119)
> It seems to be only after an unclean shutdown for me, judging by some
> judicious
> grepping of /var/log/messgaes. At least the ext4 error messages in dmesg only
> seem to happen when I don't see the line "Kernel logging (proc) stopped." in
> the messages file before the new boot.
> 
> Of course, maybe the problem is caused by - or even just made visible by -
> not
> the unclean shutdown itself, but the fsck or the journal replay which is just
> triggered by the unclean shutdown. IOW, it could be secondary damage. That
> sounds unlikely, though.

I thought that it could be the cause, but in one case of an unclean shutdown, I rebooted into 2.6.31.1, hoping I could avoid the damage, but the fs ended up corrupted anyway.

However, I didn't run fsck on each boot, so I cannot be sure the fs has been clean before.
Comment 121 Eric Sandeen 2009-10-26 22:40:40 UTC
(In reply to comment #89)
> Created an attachment (id=23458) [details]
> skip the non-journal branch if readonly

I'm skeptical of reports that this patch fixed things as well; from code inspection and instrumented testing, I don't see how we get here w/ s_journal NULL on a journalling filesystem mount, even during recovery...
Comment 122 Aneesh Kumar K.V 2009-10-27 06:39:41 UTC
Can you test with the below patch

Commit 487caeef9fc08c0565e082c40a8aaf58dad92bbb fixed a deadlock
between get_block and truncate by dropping the i_data_sem during
truncate. But i guess we should make sure we drop any preallocation
that we did as a part of the block allocation. Otherwise we definitely
can get multiple claimed blocks. But this can happen only if we have
block allocation and truncate happening on the same file simultaneously.

With delayed allocation i guess it is possible that pdflush tries to do
block allocation while application does a truncate


diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 10539e3..14539b5 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -113,6 +113,7 @@ static int ext4_ext_truncate_extend_restart(handle_t *handle,
 	 */
 	ext4_ext_invalidate_cache(inode);
 
+	ext4_discard_preallocations(inode);
 	return err;
 }
Comment 123 Aneesh Kumar K.V 2009-10-27 10:00:39 UTC
Can you try this patch ?

commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date:   Tue Oct 27 15:36:38 2009 +0530

    ext4: discard preallocation during truncate
    
    We need to make sure when we drop and reacquire the inode's
    i_data_sem we discard the inode preallocation. Otherwise we
    could have blocks marked as free in bitmap but still belonging
    to prealloc space.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..a1ef1c3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -209,6 +209,12 @@ static int try_to_extend_transaction(handle_t *handle, struct inode *inode)
 	up_write(&EXT4_I(inode)->i_data_sem);
 	ret = ext4_journal_restart(handle, blocks_for_truncate(inode));
 	down_write(&EXT4_I(inode)->i_data_sem);
+	/*
+	 * We have dropped i_data_sem. So somebody else could have done
+	 * block allocation. So discard the prealloc space created as a
+	 * part of block allocation
+	 */
+	ext4_discard_preallocations(inode);
 
 	return ret;
 }
Comment 124 Oleksij Rempel (fishor) 2009-10-27 10:38:18 UTC
I tested this two patches together, after first crush i get fs corruption: "Block bitmap differences" and "blocks count wrong". The strange thing is - the journal patch still working for me. I didn't get any evidence it hitting "w/ s_journal
NULL", so i will continue killing my system with this patch, will see if i get any fs corruptions.
Comment 125 Aneesh Kumar K.V 2009-10-27 10:48:09 UTC
You need to test only with patch in Comment #123 

Also can you make sure you start with a clean file system
before the test. ie fsck and make sure file system is clean.
Then boot the new kernel and run the test ?

-aneesh
Comment 126 David Heidelberg (okias) 2009-10-27 11:47:53 UTC
*** Bug 14422 has been marked as a duplicate of this bug. ***
Comment 127 Oleksij Rempel (fishor) 2009-10-27 12:41:54 UTC
(In reply to comment #125)
my system set up to check fs on every boot. I confirm, it was clean. Now i testing patch in Comment #123, first crash was clean.. i'll do more crashes... :)
Comment 128 Oleksij Rempel (fishor) 2009-10-27 13:02:09 UTC
So i got corruption with this patch too: "i_blocks... i_size should be", "multiply claimed blocks" ...
I got corruption o third crush.
Comment 129 Oleksij Rempel (fishor) 2009-10-27 18:50:45 UTC
Created attachment 23550 [details]
fsck with journal corruption
Comment 130 Oleksij Rempel (fishor) 2009-10-27 19:00:53 UTC
This attachment is about my previous test i use to get fs corruption (make install(3 times) && get oops && restart) with only one difference, to run fsck i use "-o ro,noload" to mount root. before i use "-o ro". Now i can see journal corruption wich was hidden by kernel before. Probably this is the cause of this issue.
Comment 131 Linus Torvalds 2009-10-27 20:00:51 UTC
Some more testing of this issue:

 - if I do something like

     [root@inspiron-laptop linux]# sync
     [root@inspiron-laptop linux]# echo b > /proc/sysrq-trigger

   I never seem to see any problems.

I've done the above several times - I started using it as my shutdown replacement, and I'm bisecting another independent issue on the same laptop. And no problems so far.

 - but when I did the whole power-button thing even when the system looked otherwise idle (it was hung at shutdown, and I waited ten+ seconds), I get in trouble very easily.

This makes me suspect that the background flush by bdflush simply isn't working well for ext4. Even when you'd expect things to be idle, and bdflush should hopefully get your metadata out, it just doesn't.

I'd suspect the 'wait=0' case in ext4_write_inode. It does the same 'don't do anything' that ext3 does, but presumably the delayed allocation makes for much worse trouble. Notice how just returning zero from ext4_write_inode() will make the general fs/fs-writeback.c just remove the inode from the dirty lists
entirely.

One thing to note is that the ext4_write_inode() behavior certainly isn't new, but it's entirely possible that the per-bdi writeback logic has changed timing of the writebacks a lot. Which may explain why it's apparently getting more attention in the 2.6.32-rc series, although some people seem to have seen it earlier too.
Comment 132 Eric Sandeen 2009-10-27 20:31:28 UTC
A few other observations from testing today.

I'm driving a test on a lab box by doing a make modules_install into /tmp/modules on the victim box, and cycling power at some random point while that runs.

2.6.32 seems to hit this reliably, with the root fs pretty well munged.  2.6.31 hasn't seen it hit yet in my testing, maybe Linus' ext4_write_inode vs. writeback theory is correct ...

Perhaps more strange, doing the same test on a non-root fs under 2.6.32 also doesn't seem to hit it reliably.  Could it be something about the remount,ro flush of the root fs on the way down?

Suspecting that possibly "mount -o ro; e2fsck -a /dev/root" during bootup was causing problems by writing to the mounted fs, I short-circuited the boot-time fsck -a; things were still badly corrupted so that doesn't seem to be it.
Comment 133 Eric Sandeen 2009-10-27 20:37:31 UTC
(In reply to comment #132)

> Perhaps more strange, doing the same test on a non-root fs under 2.6.32 also
> doesn't seem to hit it reliably.  Could it be something about the remount,ro
> flush of the root fs on the way down?

er, no, that doesn't make sense in my poweroff scenario does it!  Sigh.
Comment 134 Linus Torvalds 2009-10-27 21:23:33 UTC
On Tue, 27 Oct 2009, in comment #132 from Eric Sandeen <sandeen@redhat.com> wrote:
> 
> Perhaps more strange, doing the same test on a non-root fs under 2.6.32 also
> doesn't seem to hit it reliably.  Could it be something about the remount,ro
> flush of the root fs on the way down?
> 
> Suspecting that possibly "mount -o ro; e2fsck -a /dev/root" during bootup was
> causing problems by writing to the mounted fs, I short-circuited the
> boot-time
> fsck -a; things were still badly corrupted so that doesn't seem to be it.

It certainly isn't about the 'remount,ro' on the way down, since that's 
the part you avoid entirely in a non-clean shutdown.

But it could easily be something special about mounting the root 
filesystem, together with bad interaction with 'fsck'. 

Non-root filesystems will be fsck'd _before_ being mounted, but the root 
filesystem will be fsck'd _after_ the mount.

If the initial root ro-mount causes the filesystem recovery, and fsck then 
screws things up, then "root filesystem is special" might well trigger. It 
might explain why Ted and others are unable to re-create this - maybe they 
are being careful, and do ext4 testing with a non-ext4 root? 

Example issues that are exclusive to the root filesystem and would never 
be an issue on any other filesystem (exactly due to the "mount ro first, 
fsck later" behavior of root):

 - flaky in-kernel recovery code might trash more than it fixes, and would 
   never trigger for the "fsck first" case because fsck would already have 
   done it.

 - flaky user-mode fsck doesn't understand that the kernel already did 
   recovery, and re-does it.

 - virtually indexed caches might be loaded by the mount, and when you do 
   fsck later, the fsck writes back through the physically indexed direct 
   device. So the mounted root filesystem may never see those changes, 
   even after you re-mount it 'rw'.

 - even if every filesystem cache is physically indexed (ie using the 
   buffer cache rather page cache), there may be various cached values 
   that the kernel keeps around in separate caches, like the superblock 
   compatibility bits, free block counts, etc. fsck might change them, but 
   does 'remount,rw' always re-read them?

None of the four cases above are issues for a filesystem that isn't 
mounted before fsck runs.

			Linus
Comment 135 Eric Sandeen 2009-10-27 21:42:09 UTC
(In reply to comment #134)

> Example issues that are exclusive to the root filesystem and would never 
> be an issue on any other filesystem (exactly due to the "mount ro first, 
> fsck later" behavior of root):
> 
>  - flaky in-kernel recovery code might trash more than it fixes, and would 
>    never trigger for the "fsck first" case because fsck would already have 
>    done it.

Yep, but I didn't have auto-fsck in my non-root testing, and I did "mount -o ro" followed by "e2fsck -a" on it to try to "be like /", and didn't see it.  Though I'll re-check just to be sure.

>  - flaky user-mode fsck doesn't understand that the kernel already did 
>    recovery, and re-does it.

It should print out a message if it's recovering again, we should see it in the logs... and I don't think I have.

>  - virtually indexed caches might be loaded by the mount, and when you do 
>    fsck later, the fsck writes back through the physically indexed direct 
>    device. So the mounted root filesystem may never see those changes, 
>    even after you re-mount it 'rw'.

Yes, I've been worried about this... but would this be new?

Could probably test this theory by forcing an immediate reboot if e2fsck does recovery on a readonly mounted fs.

>  - even if every filesystem cache is physically indexed (ie using the 
>    buffer cache rather page cache), there may be various cached values 
>    that the kernel keeps around in separate caches, like the superblock 
>    compatibility bits, free block counts, etc. fsck might change them, but 
>    does 'remount,rw' always re-read them?

same test above would help determine this, I think.
Comment 136 Maxim Levitsky 2009-10-27 22:04:27 UTC
One thing to note that my notebook has separate '/' and '/home' filesystems.
Both are ext4, and I usually see corruption only on the '/'

However, much less often, but still I do see similar corruption within /home.
Comment 137 Andreas Dilger 2009-10-27 23:38:40 UTC
Alexey Fisher wrote:
> This attachment is about my previous test i use to get fs corruption (make
> install(3 times) && get oops && restart) with only one difference, to run
> fsck
> I use "-o ro,noload" to mount root. before i use "-o ro". Now i can see
> journal
> corruption wich was hidden by kernel before.

This is not a valid test.  Mounting with "-o noload" will discard all of the
transaction information in the journal, and virtually guarantee the filesystem
is inconsistent.  It will be no better than ext2, which requires a full e2fsck
run after each crash.  This is NOT a valid reproducer for this problem.
Comment 138 Oleksij Rempel (fishor) 2009-10-28 06:44:01 UTC
(In reply to comment #137)
> This is not a valid test.  Mounting with "-o noload" will discard all of the
> transaction information in the journal, and virtually guarantee the
> filesystem
> is inconsistent.  It will be no better than ext2, which requires a full
> e2fsck
> run after each crash.  This is NOT a valid reproducer for this problem.

Grr... i know my english is not good, but is it really hard to understand why i mean? "noload" is not used to reproduce crash! To crash i use default options of distribution. 
In this bug i do not trust distribution to run fsck, so i do it manually. I start in to initrd, so root is not mounted! Then i mount it manually to be sure it is readonly. Normally i mounted on this stage with option "-o ro". The result of thees was - we newer so "jurnal corruption", because the kernel silently "repaired" it.
Now i use "-o ro,noload" to mount root and run fsck (not to reproduce crush). And now i can see, journal is not corrupted after normal crush. If yournall is corrupt all fs is corrupt too.
Now is the question: do we use journal to recover fs? if we use broken journal how this recovery will look like? Do we get this "multiply claimed blocks" because we get wrong information from journal and this is the reason, why some times files which was written long time before are corrupt too?
Comment 139 Brian Rogers 2009-10-28 07:20:35 UTC
I think he meant that it clears the journal when you do that. If that's the case, then you can't confirm something went wrong using your method because for all you know the journal was correct and would have repaired everything, but now it's been cleared and fsck won't have the chance to replay it.

But yeah, if noload just leaves the journal untouched so fsck can later have a look at it, then what you tried makes sense.
Comment 140 Maxim Levitsky 2009-10-29 16:23:30 UTC
What is the best way to reproduce this bug?

I have set up a VM, and did reproduce it there, however, it reproduces very rarely.... :-(
Comment 141 Eric Sandeen 2009-10-29 16:47:01 UTC
Maxim, I'm reproducing it like this:

#!/bin/bash

echo "clearing & reinstalling modules"

mkdir -p /tmp/modules
rm -rf /tmp/modules/*

mount /dev/sdb4 /mnt/test2/
cd /mnt/test2/linux-2.6
while true; do
	make -j2 MODLIB=/tmp/modules modules_install
done

where /mnt/test2/linux-2.6 is just a recent kernel tree.

I then cut power to the box in the middle of the install, reboot, and see what shape I'm in.

I had been cutting power after a random amount of time but I think it's probably best to catch it while the modules are being actively copied over, rather than say while depmod is running, but I'm not 100% certain about that.

-Eric
Comment 142 Maxim Levitsky 2009-10-29 17:34:49 UTC
Eric Sandeen, Thanks!


I tried similar approaches, I copied lots of files, I created/removed random files, I tried to use apt-get upgrade, many things.

Sometimes I see corruption, but very rarely....

Since I feel more and more frustrated with this VM, and feel that this is a dead end,  I removed it, and now I think about some way to reproduce on real system without mass corruption.
Comment 143 Mingming Cao 2009-10-29 20:11:04 UTC
On Tue, 2009-10-27 at 15:45 +0530, Aneesh Kumar K.V wrote:
> Can you try this patch ?
> 
> commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Tue Oct 27 15:36:38 2009 +0530
> 
>     ext4: discard preallocation during truncate
>     
>     We need to make sure when we drop and reacquire the inode's
>     i_data_sem we discard the inode preallocation. Otherwise we
>     could have blocks marked as free in bitmap but still belonging
>     to prealloc space.
>     
>     Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> 

Make sense, reviewed-by: Mingming Cao <cmm@us.ibm.com>

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c5bc5d..a1ef1c3 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -209,6 +209,12 @@ static int try_to_extend_transaction(handle_t *handle,
> struct inode *inode)
>       up_write(&EXT4_I(inode)->i_data_sem);
>       ret = ext4_journal_restart(handle, blocks_for_truncate(inode));
>       down_write(&EXT4_I(inode)->i_data_sem);
> +     /*
> +      * We have dropped i_data_sem. So somebody else could have done
> +      * block allocation. So discard the prealloc space created as a
> +      * part of block allocation
> +      */
> +     ext4_discard_preallocations(inode);
> 
>       return ret;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Comment 144 Anonymous Emailer 2009-10-29 21:25:20 UTC
Reply-To: parag.lkml@gmail.com

On Tue, Oct 27, 2009 at 6:15 AM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> Can you try this patch ?
>
> commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Date:   Tue Oct 27 15:36:38 2009 +0530
>
>    ext4: discard preallocation during truncate
>
>    We need to make sure when we drop and reacquire the inode's
>    i_data_sem we discard the inode preallocation. Otherwise we
>    could have blocks marked as free in bitmap but still belonging
>    to prealloc space.

Just wanted to let you know that I have applied this patch and one
unclean shutdown later it seems to have not given me any trouble.

I will continue testing it - hopefully I won't have to reformat  this
time ( every time I tested previously I ended up having weird issues
that I decided to get rid of by reformatting /).

Parag
Comment 145 Eric Sandeen 2009-10-29 21:39:06 UTC
Parag Warudkar wrote:
> On Tue, Oct 27, 2009 at 6:15 AM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>> Can you try this patch ?
>>
>> commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65
>> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> Date:   Tue Oct 27 15:36:38 2009 +0530
>>
>>    ext4: discard preallocation during truncate
>>
>>    We need to make sure when we drop and reacquire the inode's
>>    i_data_sem we discard the inode preallocation. Otherwise we
>>    could have blocks marked as free in bitmap but still belonging
>>    to prealloc space.
> 
> Just wanted to let you know that I have applied this patch and one
> unclean shutdown later it seems to have not given me any trouble.
> 
> I will continue testing it - hopefully I won't have to reformat  this
> time ( every time I tested previously I ended up having weird issues
> that I decided to get rid of by reformatting /).

I've been running my testcase, and I just hit the usual corruption with 
this patch in place after 8 iterations, I'm afraid.

-Eric

> Parag
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Comment 146 Theodore Tso 2009-10-29 21:42:50 UTC
On Thu, Oct 29, 2009 at 05:25:17PM -0400, Parag Warudkar wrote:
> > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > Date:   Tue Oct 27 15:36:38 2009 +0530
> >
> >    ext4: discard preallocation during truncate
> >
> Just wanted to let you know that I have applied this patch and one
> unclean shutdown later it seems to have not given me any trouble.
> 
> I will continue testing it - hopefully I won't have to reformat  this
> time ( every time I tested previously I ended up having weird issues
> that I decided to get rid of by reformatting /).

Cool!

It looked like Avery Fisher had reported that had had tried patch
didn't help him, and so I had put it in the category of "good patch,
makes sense, but probably not the one that would solve this
regression".  But if it works for you, I'll accelerate getting it to
Linus.  Do let us know if you have any additional problems with this
patch, please!

						- Ted
Comment 147 Anonymous Emailer 2009-10-29 21:52:36 UTC
Reply-To: parag.lkml@gmail.com

On Thu, Oct 29, 2009 at 5:42 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Thu, Oct 29, 2009 at 05:25:17PM -0400, Parag Warudkar wrote:
>> > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>> > Date:   Tue Oct 27 15:36:38 2009 +0530
>> >
>> >    ext4: discard preallocation during truncate
>> >
>> Just wanted to let you know that I have applied this patch and one
>> unclean shutdown later it seems to have not given me any trouble.
>>
>> I will continue testing it - hopefully I won't have to reformat  this
>> time ( every time I tested previously I ended up having weird issues
>> that I decided to get rid of by reformatting /).
>
> Cool!
>
> It looked like Avery Fisher had reported that had had tried patch
> didn't help him, and so I had put it in the category of "good patch,
> makes sense, but probably not the one that would solve this
> regression".  But if it works for you, I'll accelerate getting it to
> Linus.  Do let us know if you have any additional problems with this
> patch, please!
>

I have gone through 3 unclean shutdowns so far - this time on Ubuntu
Karmic Release, today's -git and this patch.
Nothing odd to report.

[ Goes and crashes it again - comes back clean! ]

Since literally all of my previous attempts always resulted in
problems - something seems to have improved although there is no way
to tell if this is going to last long or that if it was Aneesh's patch
alone that improved things as last time i tried this I ended up
reinstalling and I switched to Karmic from Jaunty and I did not test
mainline for few days.

Any rate, I will let everyone know if I see anything different.

Thanks
Parag
Comment 148 Theodore Tso 2009-10-29 21:55:36 UTC
>>  - virtually indexed caches might be loaded by the mount, and when you do 
>>    fsck later, the fsck writes back through the physically indexed direct 
>>    device. So the mounted root filesystem may never see those changes, 
>>    even after you re-mount it 'rw'.
>
>Yes, I've been worried about this... but would this be new?

Eric, Linus,

If fsck ever modifies the filesystem, it sets bit 0 in the exit status code.   For the root file system, bit 1 is set as well.  To quote the fsck man page:

       The exit code returned by fsck is the sum of the following conditions:
            0    - No errors
            1    - File system errors corrected
            2    - System should be rebooted
            4    - File system errors left uncorrected
            8    - Operational error
            16   - Usage or syntax error
            32   - Fsck canceled by user request
            128  - Shared library error
       The exit code returned when multiple file systems are  checked  is  the
       bit-wise OR of the exit codes for each file system that is checked.

Distribution init scripts are supposed to force a reboot when the root file system is checked, and fsck returns a non-zero exit code.   This should avoid the problem that Linus is afraid of.   Of course, Ubuntu Karmic *does* have a new set of fancy-shamncy init scripts that is supposed to run (some) fsck's in parallel with the rest of the boot process.  Presumably this doesn't include critical file systems such as the root and whatever filesystems might be supplying /var or /usr.  But maybe Karmic introduced a bug and the init scripts aren't forcing a reboot after fsck prints "*** REBOOT LINUX ***" and returns with an exit status code of 3 (file system errors correct, system should be rebooted).

Note though that this shouldn't happen on a simple unclean shutdown, since the journal replay takes place before the root file system is initially mounted (even read-only).   [n.b. The fact that Avery was using -o noload, is not normal, and I'm not sure what he was trying to test.]
Comment 149 Theodore Tso 2009-10-29 22:20:16 UTC
Avery,

>In this bug i do not trust distribution to run fsck, so i do it manually. I
>start in to initrd, so root is not mounted! Then i mount it manually to be
>sure
>it is readonly. Normally i mounted on this stage with option "-o ro". The
>result of thees was - we newer so "jurnal corruption", because the kernel
>silently "repaired" it.

I think there's some confusion here.  If there is journal corruption, you would see a complaint in the dmesg logs after the root filesystem is mounted.  One of the changes which did happen as part of 2.6.32-rc1 is that journal checksums are enabled by default.  This is a *good* thing, since if there are journal corruptions, we find out about them.   Note that in this case, journal corruption usually means an I/O error has occurred, causing the checksum to be incorrect.

Now, one of the problem is that we don't necessarily have a good recovery path if the journal checksums don't check out, and the journal replay is aborted.   In some cases, finishing the journal replay might actually cause you to be better off, since it might be that the corrupted block occurs later in the journal in a correct fashion, and a full journal replay might cause you to be lucky.  In contrast, aborting the journal might end up with more corruptions for fsck to fix.

>Now i use "-o ro,noload" to mount root and run fsck (not to reproduce crush).
>And now i can see, journal is not corrupted after normal crush. If yournall is
>corrupt all fs is corrupt too.

OK, so "mount -o ro,noload" is not safe, and in fact the file system can be inconsistent if you don't replay the journal.  That's because after the transaction is committed, the filesystem starts staging writes to their final location to disk.  If we crash before this is complete, the file system will be inconsistent, and we have to replay the journal in order to make the file system be consistent.   This is why we force a journal replay by default, even when mounting the file system read-only.  It's the only way to make sure the file system is consistent enough that you can even *run* fsck.  (Distributions that want to be extra paranoid should store fsck and all of its needed files in the initrd, and then check the root filesystem before mounting it ro, but no distro does this as far as I know.)   (Also, in theory, we could implement a read-only mode where we don't replay the journal, but instead read the journal, figure out all of the blocks that would be replayed, and then intercept all reads to the file system such that if the block exists in the journal, we use the most recent version of the block in the journal instead of the physical location on disk.  This is far more complicated than anyone has had time/energy to write, so for now we take the cop-out route of replaying the journal even when the filesystem is mounted read-only.

If you do mount -o ro,noload, you can use fsck to clean the filesystem, but per my previous comment in this bug report, you *must* reboot in this case afterwards, since fsck modifies the mounted root filesystem to replay the journal, and there may be cached copies of the filesystem that are modified in the course of the replay of the journal by fsck.  (n.b., it's essentially the same code that is used to replay the journal, regardless of whether it is in the kernel or in e2fsck; the journal recovery.c code is kept in sync between the kernel and e2fsprogs sources.)  

>Now is the question: do we use journal to recover fs? if we use broken journal
>how this recovery will look like? Do we get this "multiply claimed blocks"
>because we get wrong information from journal and this is the reason, why some
>times files which was written long time before are corrupt too?

Again, I think there is some confusion here.   If the journal is "broken" (aka corrupted) you will see error messages from the kernel and e2fsck when it verifies the per-commit checksum and notices a checksum error.  It won't know what block or blocks in the journal were corrupted, but it will know that one of the blocks in a particular commit must have been incorrectly written, since the commit checksum doesn't match up.    In that case, it can be the cause of file system corruption --- but it's hardly the only way that file system corruption can occur.  There are many other ways it could have happened as well.

It seems unlikely to me that the "multiply claimed blocks" could be caused directly by wrong information in the journal.  As I said earlier, what is most likely is the block allocation bitmaps are getting corrupted, and then the file system was mounted read/write and new files written to it.  It's possible the bitmaps could have been corrupted by a journal replay, but that's only one of many possible ways that bitmap blocks could have gotten corrupted, and if the journal had been corrupted, there should have been error messages about journal checksums not being valid.
Comment 150 Eric Sandeen 2009-10-29 22:23:41 UTC
Created attachment 23592 [details]
console log w/ aneesh's patch

Well, now that I look again, w/ Aneesh's patch it didn't show up the multiply-claimed blocks this time, but did still have pretty massive corruption.  Probably needs a few more test runs to make sure that's not a fluke.

Attached is a console log of the fs errors runtime, followed by the boot-time fsck.

-Eric
Comment 151 Theodore Tso 2009-10-30 08:16:22 UTC
On Thu, Oct 29, 2009 at 04:38:57PM -0500, Eric Sandeen wrote:
>
> I've been running my testcase, and I just hit the usual corruption with  
> this patch in place after 8 iterations, I'm afraid.

Eric, since you have a relatively controllable reproduction case, have
you tried reproducing Alexey's bisection results?  Specifically he
seemed to find that commit fe188c0e shows no problem, and commit
91ac6f43 is the first commit with problems?

Having to do multiple iterations will make doing a bisection a major
pain, but maybe we'll get something out of that.

Other things that might be worth doing given that you have a test case
would be to try reverting commit 91ac6f43, and see if that helps, and
to try this patch: http://bugzilla.kernel.org/attachment.cgi?id=23468

Or have you tried some of these experiments already?

Regards,

						- Ted
Comment 152 Oleksij Rempel (fishor) 2009-10-30 08:22:10 UTC
Ted,
Thank you for explanation :)
Notice: i learning computer forensic, and was trained to mount all evidence systems with "-o ro" to not contaminate it. It seems like ext4 break this tradition, so many forensics will surprised  why md5sum do not match.

Sandeen,
One more thing: i noticed on my system, after fresh system install (partition was reformatted too), is harder to reproduce first corruption as on partition were i had corruption and fixed it with fsck. So there is possibly more false results. Do fs really good fixed after fsck?

s/Avery/Alexey/g
Comment 153 Aneesh Kumar K.V 2009-10-30 10:26:23 UTC
Hi Linus, Eric,

There are two changes that went in after 2.6.31 that i am doubtful.

0d34ec62e18984ac9476208660372306ef54e70d

That changes the ext4_file_write to not do a journal commit if inode doesn't have dirty flag set. Jan Kara actually found other issues with the ext4 sync code path and have posted a patch here( MID 1256647729-29834-1-git-send-email-jack@suse.cz )
http://article.gmane.org/gmane.comp.file-systems.ext4/16126. But that should not result in multiple claimed blocks. 


487caeef9fc08c0565e082c40a8aaf58dad92bbb
The changes allowed to release i_data_sem during truncate. That implies we could allow block allocation while we are doing truncate. I did a patch to drop prealloc space once we reacquire i_data_sem. Posted in comment 123 of this bug. But that didn't fix the corruption.

So may be we want to test backing out 487caeef9fc08c0565e082c40a8aaf58dad92bbb
and applying patch at http://article.gmane.org/gmane.comp.file-systems.ext4/16126 and see if we are able to reproduce.

-aneesh
Comment 154 Aneesh Kumar K.V 2009-10-30 10:42:38 UTC
To further update on 487caeef9fc08c0565e082c40a8aaf58dad92bbb, If we happen to create a prealloc space mapping the same offset that we are truncating them we can end up having multiple claimed blocks. ext4_mb_return_to_preallocation which is done during truncate doesn't put the blocks back into prealloc space because it depends on us discarding the prealloc space at the beginning of the truncate.

Now whether we can end up having block allocation that can get a prealloc space mapping the same offset that we are truncating is something i haven't yet figured out. We do invalidate all pages before we start a truncate. That would imply flusher threads will not find any of those dirty pages. Also for multiple claimed blocks to happen we should get the same block number that we are freeing via truncate. AFAIU that cannot happen because bits mapping those blocks are marked used in buddy cache. So we cannot preallocate them. 

But i guess discarding prealloc space after reacquiring i_data_sem seemed to be a safe bet to me. But still doesn't fully explain why we end up having multiple claimed blocks. May be i am missing some code paths in the above analysis.

-aneesh
Comment 155 Eric Sandeen 2009-10-30 13:54:35 UTC
Theodore Tso wrote:
> On Thu, Oct 29, 2009 at 04:38:57PM -0500, Eric Sandeen wrote:
>> I've been running my testcase, and I just hit the usual corruption with  
>> this patch in place after 8 iterations, I'm afraid.
> 
> Eric, since you have a relatively controllable reproduction case, have
> you tried reproducing Alexey's bisection results?  Specifically he
> seemed to find that commit fe188c0e shows no problem, and commit
> 91ac6f43 is the first commit with problems?

I can try it but I have very little faith in that result to be honest.

> Having to do multiple iterations will make doing a bisection a major
> pain, but maybe we'll get something out of that.

Well I've been doing bisects but I'm getting skeptical of the results; 
either my testcase isn't reliable enough or all the merges are confusing 
git-bisect (?)  Anyway it keeps ending up on nonsensical commits.

> Other things that might be worth doing given that you have a test case
> would be to try reverting commit 91ac6f43, and see if that helps, and
> to try this patch: http://bugzilla.kernel.org/attachment.cgi?id=23468
> 
> Or have you tried some of these experiments already?
> 
> Regards,
> 
>                                               - Ted

After talking to Aneesh last night, I think other good spot-checks will 
be to revert 487caeef9fc08c0565e082c40a8aaf58dad92bbb, and to test Jan's 
sync patches.

-Eric
Comment 156 Eric Sandeen 2009-10-30 16:27:16 UTC
Reverting 487caeef9fc08c0565e082c40a8aaf58dad92bbb (that i_sem change from Jan) on .32-rc5 still leads to significant corruption in my testing, so it doesn't seem to be the (only?) regression.  (that corruption includes many multiply-claimed blocks, FWIW).
Comment 157 Linus Torvalds 2009-10-30 18:05:28 UTC
On Fri, 30 Oct 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> Well I've been doing bisects but I'm getting skeptical of the results; 
> either my testcase isn't reliable enough or all the merges are confusing 
> git-bisect (?)  Anyway it keeps ending up on nonsensical commits.

Merges don't confuse git-bisect (lots of merges may make bisection 
slightly less _efficient_, though, because it can be impossible to find 
commits that are nicely "in the middle"), but the way bisection works, 
even a _single_ wrong "git bisect good/bad" statement can - and will - 
make bisection results go off totally in the weeds.

Bisection is designed to be efficient, which is absolutely wonderful when 
you have a 100% unambiguous bug that needs to be found in 10,000 commits. 
It's what gives us that nice log2() behavior.

But the downside of that efficiency is the same as the downside of any 
very efficient data compression with no checksum: there is absolutely zero 
redundancy. So then if even just a single bit of information (in this case 
the "good/bad" info) is wrong, the end result will not just be wrong - it 
will be _disastrously_ wrong, and you won't even be anywhere _close_ to 
the correct end result.

[ In fact, the way bisection works, if you give a single incorrect 
  bisection point, the end result will usually be close to the commit you 
  incorrectly marked, rather than the actual buggy commit you searched 
  for.

  That _can_ be used as a hint when you're unsure - when getting a clearly 
  bogus bisection result, you can look at the bisect log, and try to find 
  the first commit in the log that is close to the clearly bogus result, 
  and you might decide to re-test that commit more carefully. And then if, 
  on re-testing, it turns out that you had been wrong the first time, you 
  might be able to avoid re-doing the whole bisection thing, and instead 
  re-do the log up to the incorrect - and now fixed - entry, and continue 
  from there. ]

That said, if you end up being uncertain about the bisection points, and 
can't give more than a "95% confidence", bisection will almost inevitably 
fail. A bug that is timing-dependent can be almost impossible to bisect. 
Even if you can reliably trigger it _eventually_, you may have to test so 
long at each bisection point that bisection becomes practically useless.

		Linus
Comment 158 Eric Sandeen 2009-10-30 19:17:33 UTC
Thanks Linus, yep I probably just got "lucky" and incorrectly marked a good run.

Or, somehow the bisect between two post-.31 commits wound up with .30 in the Makefile, and I *cough* tested the wrong kernel, not expecting the version to move backwards.

I'm running some bisects & tests in the background to try to narrow it down while we have our thinking caps on ... FWIW I can confirm that 2.6.32-rc3 fails.
Comment 159 Eric Sandeen 2009-10-30 19:22:23 UTC
Created attachment 23601 [details]
console log of 2.6.32-rc3 corruption

runtime errors & subsequent fsck of corrupted fs w/ my testcase on -rc3
Comment 160 Anonymous Emailer 2009-10-30 19:56:41 UTC
Reply-To: adilger@sun.com

On 2009-10-29, at 15:38, Eric Sandeen wrote:
> Parag Warudkar wrote:
>> On Tue, Oct 27, 2009 at 6:15 AM, Aneesh Kumar K.V
>> <aneesh.kumar@linux.vnet.ibm.com> wrote:
>>> Can you try this patch ?
>>>
>>> commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65
>>> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
>>> Date:   Tue Oct 27 15:36:38 2009 +0530
>>>
>>>   ext4: discard preallocation during truncate
>>>
>>>   We need to make sure when we drop and reacquire the inode's
>>>   i_data_sem we discard the inode preallocation. Otherwise we
>>>   could have blocks marked as free in bitmap but still belonging
>>>   to prealloc space.
>> Just wanted to let you know that I have applied this patch and one
>> unclean shutdown later it seems to have not given me any trouble.
>> I will continue testing it - hopefully I won't have to reformat  this
>> time ( every time I tested previously I ended up having weird issues
>> that I decided to get rid of by reformatting /).
>
> I've been running my testcase, and I just hit the usual corruption  
> with this patch in place after 8 iterations, I'm afraid.


I wonder if there are multiple problems involved here?  Eric, it seems
possible that your reproducer is exercising a similar, though unrelated
codepath.  I would agree with Ted that if this patch is solving the
problem for some of the users it is definitely worth landing, even if it
ends up not solving all of the problems.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
Comment 161 Theodore Tso 2009-10-31 09:15:45 UTC
On Fri, Oct 30, 2009 at 01:56:27PM -0600, Andreas Dilger wrote:
> I wonder if there are multiple problems involved here?  Eric, it seems
> possible that your reproducer is exercising a similar, though unrelated
> codepath.  

Note that Aneesh has pubished two patches which insert a call to
ext4_discard_preallocations().  One is a patch which inserts it into
fs/inode.c's truncate path (for direct/indirect-mapped inodes) and one
which is patch which inserts it into fs/extents.c truncate path (for
extent-mapped inodes).  As near as I can tell both patches are
necessary, and it looks to me like they should be combined into a
single patch, since commit 487caeef9 affects both truncate paths.
Aneesh, do you concur?

Like Andreas, I am suspicious that there may be multiple problems
occurring here, so here is a proposed plan of attack.

Step 1) Sanity check that commit 0a80e986 shows the problem.  This is
immediately after the first batch of ext4 patches which I sent to
Linus during the post-2.6.31 merge window.  Given that patches in the
middle of this first patch have been reported by Avery as showing the
problem, and while we may have some "git bisect good" revisions that
were really bad, in general if a revision is reported bad, the problem
is probably there at that version and successive versions.  Hence, I'm
_pretty_ sure that 0a80e986 should demonstrate the problem.

Step 2) Sanity check that commit ab86e576 does _not_ show the problem.
This commit corresponds to 2.6.31-git6, and there are no ext4 patches
that I pushed before that point.  There are three commits that show up
in response to the command "git log v2.6.31..v2.6.31-git6 -- fs/ext4
fs/jbd2", but they weren't pushed by me.  Although come to think of
it, Jan Kara's commit 0d34ec62, "ext4: Remove syncing logic from
ext4_file_write" is one we might want to look at very carefully if
commit ab86e576 also shows the problem....

Step 3) Assuming that Step 1 and Step 2 are as I expect, with commit
ab86e576 being "good", and commit 0a80e986 being "bad", we will have
localized the problem commit(s) to the 63 commits that were initially
pushed to Linus during the merge window.  One of the commits is
487caeef9, which Aneesh has argued convincingly seems to be
problematic, and which seems to solve at least one or two reporter's
problems, but clearly isn't a complete solution.  So let's try to
narrow things down further by testing this branch:

    git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-history

This branch corresponds to commit ab86e576 (from Step 2), but with the
problematic commit 487caeef9 removed.  It was generated by applying
the following guilt patch series to v2.6.31-git6:

    git://repo.or.cz/ext4-patch-queue.git test-history

The advantage of starting with the head of test-history is that if
there are multiple problematic commits, this should show the problem
(just as reverting 487caeef9 would) --- but since 487caeef9 is
actually removed, we can now do a "git bisect start test-history
v2.6.31-git6" and hopefully be able to localize whatever additional
commits might be bad.

(We could also keep applying and unapplying the patch corresponding to
the revert of 487caeef9 while doing a bisection, but that tends to be
error prone.)

Does that sounds like a plan?

							- Ted
Comment 162 Aneesh Kumar K.V 2009-10-31 15:26:01 UTC
On Sat, Oct 31, 2009 at 05:15:28AM -0400, Theodore Tso wrote:
> On Fri, Oct 30, 2009 at 01:56:27PM -0600, Andreas Dilger wrote:
> > I wonder if there are multiple problems involved here?  Eric, it seems
> > possible that your reproducer is exercising a similar, though unrelated
> > codepath.  
> 
> Note that Aneesh has pubished two patches which insert a call to
> ext4_discard_preallocations().  One is a patch which inserts it into
> fs/inode.c's truncate path (for direct/indirect-mapped inodes) and one
> which is patch which inserts it into fs/extents.c truncate path (for
> extent-mapped inodes).  As near as I can tell both patches are
> necessary, and it looks to me like they should be combined into a
> single patch, since commit 487caeef9 affects both truncate paths.
> Aneesh, do you concur?
> 

We need only the patch that drop prealloc space in ext4_truncate_restart_trans
ext4_ext_truncate_extend_restart calls ext4_truncate_restart_trans. So adding
the prealloc space dropping in ext4_truncate_restart_trans should handle both
direct/indirect-mapped inode and extent-mapped inodes.

-aneesh
Comment 163 Eric Sandeen 2009-10-31 16:16:32 UTC
(In reply to comment #161)

> Step 1) Sanity check that commit 0a80e986 shows the problem.  This is
> immediately after the first batch of ext4 patches which I sent to
> Linus during the post-2.6.31 merge window.  Given that patches in the
> middle of this first patch have been reported by Avery as showing the
> problem, and while we may have some "git bisect good" revisions that
> were really bad, in general if a revision is reported bad, the problem
> is probably there at that version and successive versions.  Hence, I'm
> _pretty_ sure that 0a80e986 should demonstrate the problem.
> 
> Step 2) Sanity check that commit ab86e576 does _not_ show the problem.
> This commit corresponds to 2.6.31-git6, and there are no ext4 patches
> that I pushed before that point.  There are three commits that show up
> in response to the command "git log v2.6.31..v2.6.31-git6 -- fs/ext4
> fs/jbd2", but they weren't pushed by me.  Although come to think of
> it, Jan Kara's commit 0d34ec62, "ext4: Remove syncing logic from
> ext4_file_write" is one we might want to look at very carefully if
> commit ab86e576 also shows the problem....

Ok, I'm testing these.

I'm really starting to think, though, that this might be a latent problem exposed by other changes; I can't otherwise explain my utter failure at trying to run git-bisect over larger ranges...

-Eric
Comment 164 Eric Sandeen 2009-10-31 19:15:31 UTC
Well, the proposed "good" point survived 20 runs.  The proposed bad point survived 14 runs, and then showed a problem.  Confidence level... well... not great yet.

I'll test the good point for more runs, to be sure.
Comment 165 Oleksij Rempel (fishor) 2009-10-31 19:56:19 UTC
Oh.. yehh.. now i see haw bad was my bisection.
Comment 166 Eric Sandeen 2009-11-02 04:06:56 UTC
well, this time I went for many many more runs on each bisect point and came up with:

# good: [ab86e5765d41a5eb4239a1c04d613db87bea5ed8] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6
git bisect good ab86e5765d41a5eb4239a1c04d613db87bea5ed8

# bad: [0a80e9867db154966b2a771042e10452ac110e1e] ext4: replace MAX_DEFRAG_SIZE with EXT_MAX_BLOCK
git bisect bad 0a80e9867db154966b2a771042e10452ac110e1e

# good: [98a56ab382079f777e261e14512cbd4fb2107af4] ext4: Fix spelling typo in the trace format for trace_ext4_da_writepages()
# Survived 46 runs
git bisect good 98a56ab382079f777e261e14512cbd4fb2107af4

# bad: [b6a758ec3af3ec236dbfdcf6a06b84ac8f94957e] ext4: move ext4_mb_init_group() function earlier in the mballoc.c
# Failed in 20 runs
git bisect bad b6a758ec3af3ec236dbfdcf6a06b84ac8f94957e

# bad: [daea696dbac0e33af3cfe304efbfb8d74e0effe6] ext4: Remove unneeded BUG_ON() in ext4_move_extents()
# Failed in 40 runs
git bisect bad daea696dbac0e33af3cfe304efbfb8d74e0effe6

# good: [a3710fd1ee8cd542c5de63cf2c39f8912031f867] ext4: fix tracepoint format string warnings
# Survived 59 runs
git bisect good a3710fd1ee8cd542c5de63cf2c39f8912031f867

# bad: [5f3481e9a80c240f169b36ea886e2325b9aeb745] ext4: fix cache flush in ext4_sync_file
# Failed in 40 runs
git bisect bad 5f3481e9a80c240f169b36ea886e2325b9aeb745

# bad: [d0646f7b636d067d715fab52a2ba9c6f0f46b0d7] ext4: Remove journal_checksum mount option and enable it by default
# Failed in 9 runs
git bisect bad d0646f7b636d067d715fab52a2ba9c6f0f46b0d7

So it's looking like a problem with journal checksums.  I backed that patch out and am running the testcase overnight to convince myself that it survives without it.  The patch itself seems fine - it just changes the default; I'm looking at the journal checksum code to see if I can see any problem.
Comment 167 Eric Sandeen 2009-11-02 17:05:38 UTC
My test overnight ran successfully through > 100 iterations of the test, on a tree checked out just prior to d0646f7b636d067d715fab52a2ba9c6f0f46b0d7.

This morning I ran that same tree with the journal checksums enabled via mount option, saw that journal corruption was found by the checksumming code, and immediately after that we saw the corruption.  So it is the checksum feature being on which is breaking this for us.

Linus, I would recommend reverting d0646f7b636d067d715fab52a2ba9c6f0f46b0d7 for now, at this late stage in the game, and those present on the ext4 call this morning agreed.

A few things seem to have gone wrong; for one we should have at least issued a printk when we found a bad journal checksum but we silently continued on thanks to a RDONLY check (and the root fs is mounted readonly...)

My hand-wavy hunch about what is happening is that we're finding a bad checksum on the last partially-written transaction, which is not surprising, but if we have a wrapped log and we're doing the initial scan for head/tail, and we abort scanning on that bad checksum, then we are essentially running an unrecovered filesystem.

But that's hand-wavy and I need to go look at the code.

We lived without journal checksums on by default until now, and at this point they're doing more harm than good, so we should revert the default-changing commit until we can fix it and do some good power-fail testing with the fixes in place.

I'll revert that patch and do another overnight test on an up-to-date tree to be sure nothing else snuck in, but this looks to me like the culprit, and I'm comfortable recommending that the commit be reverted for now.

Thanks,
-Eric
Comment 168 Eric Sandeen 2009-11-02 23:10:53 UTC
if the folks who saw this could revert d0646f7b636d067d715fab52a2ba9c6f0f46b0d7 (or just pull Linus' git tree, because he's reverted it upstream) and retest your scenarios that'd be great.

-Eric
Comment 169 Linus Torvalds 2009-11-03 01:06:56 UTC
On Mon, 2 Nov 2009, Eric Sandeen <sandeen@redhat.com> wrote:
>
> if the folks who saw this could revert
> d0646f7b636d067d715fab52a2ba9c6f0f46b0d7
> (or just pull Linus' git tree, because he's reverted it upstream) and retest
> your scenarios that'd be great.

I've done a few dirty reboots on the laptop that was showing the problem 
(trying to intentionally blow things up by rebooting in the middle of a 
kernel compile), and have yet to see the kinds of problems I had 
originally.

Instead of getting problems with "lookup()" finding deleted inodes, I get 
orphan cleanup during recovery, and the filesystem looks ok. I get 
truncated files in the compile (and need to remove the bad *.o files), 
but that's what I'd _expect_ from killing the machine in the middle of a 
compile.

So I think you nailed it with the revert.

			Linus
Comment 170 Theodore Tso 2009-11-03 13:37:44 UTC
Eric, many thanks for finding the root cause for this!!   I think we still need to apply Aneesh's proposed fix, but I suspect it's responsible for a very tiny percentage of the reported problems, since that requires pdflush racing against a truncate/unlink, which should be a very rare case.
Comment 171 Ric Wheeler 2009-11-03 13:43:09 UTC
It seems that the patch that got reversed was simply the one that flipped journal checksums back to a non-default mount option. In other words, I would still expect to see this is someone mounts ext4 with the journal checksum options.

Not a blocker for now, but still something that we need to understand and fix.
Comment 172 Theodore Tso 2009-11-03 13:58:25 UTC
There's a lot more we need to understand --- including why we weren't seeing a printk message indicating a journal checksum, and which commit was showing the checksum failure, and why we ended up seeing a checksum failure in the first place.   If it was the last commit that was being written right before the system crash, the commit block simply should have been missing due to the barrier.   But we can do this without having to worry about 2.6.32 being a QA disaster for ext4.  :-)
Comment 173 Eric Sandeen 2009-11-03 14:32:26 UTC
(In reply to comment #172)
> There's a lot more we need to understand --- including why we weren't seeing
> a
> printk message indicating a journal checksum, and which commit was showing
> the
> checksum failure, and why we ended up seeing a checksum failure in the first
> place.   If it was the last commit that was being written right before the
> system crash, the commit block simply should have been missing due to the
> barrier.   But we can do this without having to worry about 2.6.32 being a QA
> disaster for ext4.  :-)

I am looking for the checksum problem today.  I agree, seeing an otherwise-valid commit block w/ a bad checksum is very troubling - barrier infrastructure problems?  Or something else, will have to see.

I know why we're not seeing a printk, already; it's under a RDONLY test, and the root fs is mounted readonly; I have a patch for that though I wonder what the right choice is when we encounter this corruption - I'm leaning towards treating it like any other corruption and going through ext4_error...

...anyway, still looking at all this today, even w/ the .32 qa disaster averted for now.  :)
Comment 174 Eric Sandeen 2009-11-03 23:39:40 UTC
Tidbit; after turning on checksums but disabling any action on a mismatch, and adding a little instrumentation, I got something like:

checksum error 1293235, (not) breaking
normal scan finish, setting end_transaction to 1293246
(fs/jbd2/recovery.c, 388): do_one_pass: Starting recovery pass 1
(fs/jbd2/recovery.c, 388): do_one_pass: Starting recovery pass 2
(fs/jbd2/recovery.c, 257): jbd2_journal_recover: JBD: recovery, exit status 0, recovered transactions 1293215 to 1293246

IOW, the checksum error was far from the last transaction in the log; it was somewhere in the middle-ish of what needed to be replayed.

So this more or less explains the corruption (many transactions were not replayed because recovery was aborted when the bad checksum was found), and it doesn't necessarily point to a problem with the transaction in flight when the power-fail occurred... we had about 10 transactions after that.  Hrm.
Comment 175 Eric Sandeen 2009-11-03 23:43:26 UTC
Double-hrm; if it's not the transaction that was right during the power-fail, maybe this is a hint/clue for the other bug that claims corruption after a -clean- shutdown.  Probably time to go look for races, buffer re-use, etc in the checksumming code...?
Comment 176 Theodore Tso 2009-11-04 03:33:25 UTC
> Double-hrm; if it's not the transaction that was right during the power-fail,
> maybe this is a hint/clue for the other bug that claims corruption after a
> -clean- shutdown.  Probably time to go look for races, buffer re-use, etc in
> the checksumming code...?

I *hope* that is what it is.  The scarrier alternative is that we have
a race/buffer re-use or some other race that we've never noticed
because it's relatively rare enough, and if it's a block that ends up
getting overwritten by a subsequent transaction, it would be a
self-healing problem....

						- Ted
Comment 177 Chris Mason 2009-11-06 20:49:25 UTC
Created attachment 23685 [details]
Horrible horrible debugging hack

I started reading this because I was afraid the writeback changes were broken, but tracking down crc errors is really interesting as well.  I started with the assumption someone was changing the buffer after we crc but before it hits disk, which in ext4 terms means that someone is probably changing data that has been sent to the log without properly diving into the jdb2 do_get_write_access()

Eric helped prove this is the case by testing a patch that always crcs and writes a stable duplicate of the metadata buffer instead of the metadata buffer itself (a one liner since the log code was already setup for this).

This attachment is a hack but the basic idea is to set the page for a buffer head readonly before we crc the page, and then set it read/write again when we're done with it.  It will only work when the pagesize and the blocksize are the same.

It probably has some false positives lurking, but so far it has found a number of places where ext4 is changing a buffer without properly calling into the jdb2 do_get_write_access() func.

Eric hit a few easy ones:

ext4_statfs():
        /* why is statfs changing the super? */
        ext4_free_blocks_count_set(es, buf->f_bfree);

ext4_setxattr_set_handle()
                memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);

But I hit a much harder case to fix.  ext4_symlink will log the data blocks that hold the symlink name (well ext4_write_begin will log them).

ext4 uses the generic symlink read helpers, so when we read the link we dive into page_getlink() which uses nd_terminate_link to stuff a NULL into the end of the link's name.  Since the null goes directly into the page cache page and the page was logged, we're changing things that could affect the log crcs.

I think a more generic and less hacky form of this patch would be a really good thing...maybe it already exists and I just don't know about it yet ;)
Comment 178 Linus Torvalds 2009-11-06 22:07:03 UTC
On Fri, 6 Nov 2009, Chris Mason <chris.mason@oracle.com> wrote:
> 
> But I hit a much harder case to fix.  ext4_symlink will log the data blocks
> that hold the symlink name (well ext4_write_begin will log them).
> 
> ext4 uses the generic symlink read helpers, so when we read the link we dive
> into page_getlink() which uses nd_terminate_link to stuff a NULL into the end
> of the link's name.

Gaah. We've had some issues with that before too. Sadly, the name lookup 
routines _really_ want the terminating NUL at the end, so the alternatives 
are to (a) allocate scratch space and copy every time you follow a symlink 
or (b) do major surgery on the name lookup code to take a 'ptr,len' 
approach to names. Both seem nasty.

For ext4, one thing I would suggest is to simply always put the NUL at the 
end of the name, even if the filesystem might not need it. Then 
nd_terminate_link() will still write the NUL, but it won't change any data 
(and thus no CRC's).

That said, I haven't looked at how nasty it would be to change the name 
lookup to use <ptr,len>. Maybe it wouldn't be as horrible as I suspect it 
is (the filesystems themselves wouldn't need changing, since the lowlevel 
filesystem name lookup routines always sees a 'struct dentry' rather than 
the pathname).

				Linus
Comment 179 Eric Sandeen 2009-11-06 22:15:16 UTC
(In reply to comment #178)

 
> For ext4, one thing I would suggest is to simply always put the NUL at the 
> end of the name, even if the filesystem might not need it. Then 
> nd_terminate_link() will still write the NUL, but it won't change any data 
> (and thus no CRC's).

Ah, good idea in the short term at least.

FWIW I've got the other 2 places I hit fixed up and my testcase is surviving thus far.  I'll find a testcase for the symlink, fix that up, and get some patches out.

Big kudos to Chris for the swanky patch to make this all obvious...

-Eric
Comment 180 Linus Torvalds 2009-11-06 22:26:13 UTC
On Fri, 6 Nov 2009, Eric Sandeen <sandeen@redhat.com> wrote:
> 
> > For ext4, one thing I would suggest is to simply always put the NUL at the 
> > end of the name, even if the filesystem might not need it. Then 
> > nd_terminate_link() will still write the NUL, but it won't change any data 
> > (and thus no CRC's).
> 
> Ah, good idea in the short term at least.

Btw, if you do that, then for Chris' debug patch you might then make 
nd_terminate_link() do a conditional write ("only write NUL if it isn't 
already NUL"), and thus avoid the debug code triggering for that case.

		Linus
Comment 181 Anonymous Emailer 2009-11-08 18:12:03 UTC
Reply-To: hch@infradead.org

On Fri, Nov 06, 2009 at 10:07:15PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> That said, I haven't looked at how nasty it would be to change the name 
> lookup to use <ptr,len>. Maybe it wouldn't be as horrible as I suspect it 
> is (the filesystems themselves wouldn't need changing, since the lowlevel 
> filesystem name lookup routines always sees a 'struct dentry' rather than 
> the pathname).

FYI there are other filesystems that historically store the the name
without terminating NULL in the symlink and would benefit from that.
Comment 182 Theodore Tso 2009-11-08 19:12:34 UTC
On Fri, Nov 06, 2009 at 10:07:15PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> For ext4, one thing I would suggest is to simply always put the NUL at the 
> end of the name, even if the filesystem might not need it. Then 
> nd_terminate_link() will still write the NUL, but it won't change any data 
> (and thus no CRC's).

It's pretty simple to do this; in fact it's a one-line change in
ext4_symlink() in fs/ext4/inode.c:

		/* clear the extent format for fast symlink */
		EXT4_I(inode)->i_flags &= ~EXT4_EXTENTS_FL;
		inode->i_op = &ext4_fast_symlink_inode_operations;
-		memcpy((char *)&EXT4_I(inode)->i_data, symname, l);
+		strcpy((char *)&EXT4_I(inode)->i_data, symname);
		inode->i_size = l-1;

					- Ted
Comment 183 Linus Torvalds 2009-11-08 21:05:37 UTC
On Sun, 8 Nov 2009, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> It's pretty simple to do this; in fact it's a one-line change in
> ext4_symlink() in fs/ext4/inode.c:
> 
>         /* clear the extent format for fast symlink */
>         EXT4_I(inode)->i_flags &= ~EXT4_EXTENTS_FL;
>         inode->i_op = &ext4_fast_symlink_inode_operations;
> -        memcpy((char *)&EXT4_I(inode)->i_data, symname, l);
> +        strcpy((char *)&EXT4_I(inode)->i_data, symname);
>         inode->i_size = l-1;

I think that's wrong.

First off, it's slower. Secondly, 'l' is already 'strlen+1', so it looks 
like ext4 _already_ copies the NUL terminating character.

So I suspect the debugging patch complained about writing a NUL to a 
location that was already NUL to begin with.

		Linus
Comment 184 Theodore Tso 2009-11-08 21:12:42 UTC
> First off, it's slower. Secondly, 'l' is already 'strlen+1', so it looks 
> like ext4 _already_ copies the NUL terminating character.
> 
> So I suspect the debugging patch complained about writing a NUL to a 
> location that was already NUL to begin with.

Umm, yeah.  Good point.  Since the debugging patch marks the page ro,
short of doing a test before trying to null-terminate the symlink, the
debugging patch is always going to flag a false-positive here.

	  	   	  	   	- Ted
Comment 185 Chris Mason 2009-11-09 14:54:13 UTC
Good point on the symlink code we must be copying the null already.

There's another strange one I hit at mount time last week but forgot to include.  It could be noise from earlier debugging runs of the patch that left readonly pages setting around, but I think if I squint really it makes sense.

During mounts and remounts, ext4 will mess around with the flags on the journal super.  I don't see exactly where the buffer is logged or prepared...most of the callers of jbd2_journal_clear_features look like they need a bigger audit to me.

Does anyone know the perf cost of forcing the page ro/rw?  I'm looking at integrating it into a CONFIG_DEBUG_CSUMS for btrfs.  For any individual page it shouldn't happen too often, but we'd be hammering on it pretty good during a commit.
Comment 186 Rafael J. Wysocki 2009-11-17 22:20:08 UTC
On Tuesday 17 November 2009, Theodore Tso wrote:
> On Mon, Nov 16, 2009 at 11:37:39PM +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.31.  Please verify if it still should be listed and let me know
> > (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=14354
> > Subject             : Bad corruption with 2.6.32-rc1 and upwards
> > Submitter   : Holger Freyther <zecke@selfish.org>
> > Date                : 2009-10-09 15:42 (39 days old)
> 
> Um, this was marked as resolved, until you reopened it and then reset
> the state to New.  Why did you do this?
> 
> It's fixed in mainline as of commit d4da6c9 when Linus reverted commit
> d0646f7.  Users could still see it if they mount a file system with -o
> journal_checksum, but (a) it's no longer the default, and (b)
> corruption if you use the non-default journal_checksum mount option is
> not a regression.
> 
> We have fixes to make journal_checksum safe queued for 2.6.33, but the
> revert fixes the regression problem.