Bug 14354
Description
Holger Freyther
2009-10-09 15:42:16 UTC
(In reply to comment #0) > I don't know how to reproduce but I'm seeing the following issues with > 2.6.32-rcX since the "BUG_ON with page buffers has been fixed" on a daily > basis. > > - Clean shutdown > - Reboot -> fs turns into ro mode When it does this, you should get something in dmesg; can you attach that? > - Reboot -> fsck. I see blocks assigned to two inodes and many lost+found of > files touched close before the shutdown or during the shutdown. This requires > to go through the 1B, 1C, 1D pass on fsck. please attach the actual e2fsck output. Thanks, -Eric (In reply to comment #1) > (In reply to comment #0) > > I don't know how to reproduce but I'm seeing the following issues with > > 2.6.32-rcX since the "BUG_ON with page buffers has been fixed" on a daily > > basis. > > > > - Clean shutdown > > - Reboot -> fs turns into ro mode > > When it does this, you should get something in dmesg; can you attach that? I will try, it is a bit difficult as the distro is not booting up to a login prompt in this case. I will try hard. > > > - Reboot -> fsck. I see blocks assigned to two inodes and many lost+found > of > > files touched close before the shutdown or during the shutdown. This > requires > > to go through the 1B, 1C, 1D pass on fsck. > > please attach the actual e2fsck output. Do you have a better idea than using fsck -y / > /boot/fsck.output for keeping the log? (In reply to comment #2) > > When it does this, you should get something in dmesg; can you attach that? > > I will try, it is a bit difficult as the distro is not booting up to a login > prompt in this case. I will try hard. if you see it on the screen just a general idea or even a photo is fine :) (may need to switch consoles) > > please attach the actual e2fsck output. > > Do you have a better idea than using fsck -y / > /boot/fsck.output for > keeping > the log? hopefully that works fine :) Thanks, -eric Can you also attach the output of dmesg, so we can see what kind of device you are using for your root filesystem, and what sort of boot messages are being emitted by the kernel? This could very well be a device driver problem. Also, what distribution are you using? I have same issue three of my systems. This is no hardware issue. I will try to reproduce it on virtual system. Any tips how to make it easy? Ok, i think i know now how it's happed. I nneded some time to play to reproduce it. There is _no_difference_ if you just boot 2.6.32 kernel run fsck and then boot 2.6.31 kernel and run fsck again - nothing will happen. here is the way i reproduced it: 1. crash (reset or poweroff) 2.6.32-rc3 kernel, 2. start again in 2.6.32-rc3 3. run fsck it well looks clean.. but some progrums will lost it setting. 4. reboot and start with 2.6.31 kernel 5. run fsck and this will find that ext4 is brocken. 6. after fsck fix it.. many fils will be lost. the guestion is why fsck think the partition is clean if it use 2.6.32 and if you force fsck on 2.6.31 it will "fix" it even ther was new files on broken bloks? i use e2fsprogs version 1.41.9 -1ubuntu1 Created attachment 23333 [details]
fsck log
This log i get after steps described in my message before. This time i lost just some settings on compiz, evolution and miro ... Thank good i make bakups.
>1. crash (reset or poweroff) 2.6.32-rc3 kernel, >2. start again in 2.6.32-rc3 >3. run fsck it well looks clean.. but some progrums will lost it setting. Are you running fsck with the -f option, or not? >4. reboot and start with 2.6.31 kernel This is a clean shutdown or a crash? > 5. run fsck and this will find that ext4 is brocken. Again, is this an fsck -f (forced fsck), or just a normal fsck? What arguments are you using to fsck each time? One more question; is this a completely stock 2.6.32-rcX kernel, or is this one with special patches from Ubuntu? If so, can you give me a pointer to the Ubuntu git tree and the commit ID that was used? When the file system gets remounted read-only, can you please send the output of the "dmesg" command? If you don't have an extra file system where you can save the dmesg output, please do "dmesg | grep -i ext4" and copy down what you see. Thanks!! Now i tryed to make this test clean as possible. 1. start pc and after some time reset it or poweroff 2.after restart start kernel with option break=mount. it will not mount root fs so use initrd instead. 3. mount root partition manually in reed only mode. 4. chroot 5. fsck -y /dev/root >> fsck.log 6. fsck -y -f /dev/root >> fsck.log on both kernel after krasch "fsck -y" will return: ============================================================= fsck from util-linux-ng 2.16 /dev/mapper/zwerg_buntu-root_one: clean, 266498/1220608 files, 2444502/4882432 blocks dmesg on mount will say this (looks like both kernel ak the same way): ============================================================= [ 32.797149] EXT3-fs: dm-0: couldn't mount because of unsupported optional features (240). [ 32.808599] EXT4-fs (dm-0): INFO: recovery required on readonly filesystem [ 32.809755] EXT4-fs (dm-0): write access will be enabled during recovery [ 32.823038] EXT4-fs (dm-0): barriers enabled [ 33.014768] kjournald2 starting: pid 1166, dev dm-0:8, commit interval 5 seconds [ 33.014792] EXT4-fs (dm-0): delayed allocation enabled [ 33.014794] EXT4-fs: file extents enabled [ 33.014937] EXT4-fs: mballoc enabled [ 33.014954] EXT4-fs (dm-0): orphan cleanup on readonly fs [ 33.014958] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131262 [ 33.014994] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131238 [ 33.015004] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131164 [ 33.015014] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131161 [ 33.015023] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131137 [ 33.015032] EXT4-fs (dm-0): ext4_orphan_cleanup: deleting unreferenced inode 131119 [ 33.015041] EXT4-fs (dm-0): 6 orphan inodes deleted [ 33.015042] EXT4-fs (dm-0): recovery complete [ 33.397102] EXT4-fs (dm-0): mounted filesystem with ordered data mode if force fsck "fsck -y -f .." on 2.6.31 kernel: =========================================================== /dev/mapper/zwerg_buntu-root_one: clean, 266499/1220608 files, 2443934/4882432 blocks fsck from util-linux-ng 2.16 Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/mapper/zwerg_buntu-root_one: 266499/1220608 files (0.7% non-contiguous), 2443934/4882432 blocks on 2.6.32-rX: =========================================================== /dev/mapper/zwerg_buntu-root_one: clean, 266474/1220608 files, 2444555/4882432 blocks fsck from util-linux-ng 2.16 Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Entry 'AB08-118C' in /media (32769) has deleted/unused inode 123461. Clear? yes Entry 'mtab' in /var/lib/DeviceKit-disks (57430) has deleted/unused inode 393. Clear? yes Pass 3: Checking directory connectivity Pass 4: Checking reference counts Unattached inode 407 Connect to /lost+found? yes Inode 407 ref count is 2, should be 1. Fix? yes Inode 32769 ref count is 7, should be 6. Fix? yes Pass 5: Checking group summary information Block bitmap differences: -1606658 -1606772 Fix? yes Free blocks count wrong for group #0 (9591, counted=9592). Fix? yes Free blocks count wrong (2437877, counted=2437878). Fix? yes Free inodes count wrong for group #15 (6623, counted=6624). Fix? yes Directories count wrong for group #15 (791, counted=790). Fix? yes Free inodes count wrong (954134, counted=954135). Fix? yes /dev/mapper/zwerg_buntu-root_one: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/zwerg_buntu-root_one: ***** REBOOT LINUX ***** /dev/mapper/zwerg_buntu-root_one: 266473/1220608 files (0.7% non-contiguous), 2444554/4882432 blocks I i didn't get to reproduce it on kvm-qemu. Holger, could you try running same test and see if you get similar results? Alexey, when you say that the dmesg was the same for both kernels, was it _identical_? Was it always the same number of orphaned inodes which were deleted, and were the inode numbers always the same? And you're sure you didn't see anything like: EXT4-fs (dm-0): error: ext4_fill_super: journal transaction 4641626 is corrupt EXT4-fs (dm-0: Mounting filesystem read-only In Holger's report, he mentions that after the first reboot, the file system turns read-only. Holger, what do you see in your system logs after each reboot, in particular before the filesystem gets mounted or remounted read-only? Yes, i had read-only problem too, but i can't reproduce it now. I had some trouble to mount system and fsck it, some important files was corrupt. Suddenly all i wanted is to make it work again.. so there is not dmesg after it. On my major PC i used debsums to find broken files and reinstall this. On my laptop i didn't had so mach luck, the packagebase was corrupted so i reformatted and reinstall the system. I can try to go back to 2.6.32-rcX and use it until next crush, but i need to know what to do, to extrakt all information will need for this bug. Thirst good idea will be probably remote syslog server. I've been trying to recreate this failure mode, and haven't had any luck. It looks like you are using LVM, so a really good thing to do is to use the e2croncheck script right after you reboot and login to the system. If it reports a clean filesystem, no worries. If it doesn't, then it would really be good to snapshot the output of dmesg to /tmp (I assume you have /tmp mounted as tmpfs) and e-mail it someplace safe, and e2croncheck can also be configured to e-mail the e2fsck output someplace safe as well. I'll note that that what Holger and Alexey are seeing are somewhat different. Holger seems to be seeing a problem after a clean shutdown after re-installing glibc from a build directory. That would imply orphaned inode handling getting screwed up some how, but I haven't been able to reproduce this on my systems. Alexey is seeing a problem after a crash/power failure, which implies some sort of issue with journal replay. One change that we did make between 2.6.31 and 2.6.32 is that we enable journal checksums by default. In theory if the hard drive is ignoring barriers, or is otherwise screwing up the journal I/O, we could get a journal checksum error that would abort the journal playback. That in theory could explain Alexey's symptoms, but he didn't report a journal checksum error message. So I really don't know what's going on right now. Created attachment 23350 [details]
E2croncheck shell script
Created attachment 23353 [details]
jbd2 debugging patch
The jbd2-checksum-debugging patch adds two options which can be added to the boot line. The first, jbd2.xsum_debug=1, will print two additional kernel messages when replaying a journal that will help debug checksums. The second, jbd2.ignore_xsum=1, will cause jbd2 to ignore any checksum errors.
If you can come up with a repeatable test case, it would be useful to apply this patch and then add "jbd2.xsum_debug=1 jbd2.ignore_xsum=1" to the boot line. Then let me know (a) if the problem goes away when you add the boot command-line options, and (b) give me the dmesg output either way.
Thanks!
(In reply to comment #16) > I'll note that that what Holger and Alexey are seeing are somewhat different. > Holger seems to be seeing a problem after a clean shutdown after > re-installing > glibc from a build directory. I have been trying to reproduce it the last two days but didn't really succeed. I'm not sure if it still makes sense to upload a dmesg/config to see my system. Basically I'm running Ubuntu Karmic (one thing in common) with a self built kernel (plain kernel.org) and I'm using a crypto root (so he and me have device mapper in use of some sort). I will try more to reproduce this. Thanks for taking a look. Created attachment 23356 [details]
Boot + mounting FS roo
This was reached after getting two crashes/screen blanks on 2.6.32-rc3 (something bad with KMS and the intel driver) and two forced resets. The filesystem got mounted read only during make -j 8 (using icecc/icecream to distribute compile jobs) in my WebKit build tree and a git pull --rebase in the linux-2.6 tree.
Created attachment 23357 [details]
Log message on restart after journal abort
That is from the reboot (using an older 2.6.32-rc2 kernel due intel driver issues on the console). The kernel is from shortly after the BUG_ON fix on the pagebuffers (is that enough information).
Created attachment 23358 [details]
fsck.log with blocks assigned to multiple inodes.
So the fs assigned recent blocks both to my WebKit build and git pull --rebase in the kernel tree.
When you do these "forced resets," are they power cycles? This one was found on 2.6.30-1 reported by one of my friend Hi Aneesh, I got the second ext4 fs error on my wife's eeepc. I have recorded what was fixed. The kernel is 2.6.30-1 from debian sid. suvaipc:~# fsck.ext4 /dev/mapper/eeepc-suvai e2fsck 1.41.3 (12-Oct-2008) /dev/mapper/eeepc-suvai contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Inode 336893 has illegal block(s). Clear<y>? yes Illegal block #12 (7602287) in inode 336893. CLEARED. Illegal block #13 (6422560) in inode 336893. CLEARED. Illegal block #15 (7274595) in inode 336893. CLEARED. Illegal block #16 (7536750) in inode 336893. CLEARED. Illegal block #17 (6553705) in inode 336893. CLEARED. Illegal block #18 (7471205) in inode 336893. CLEARED. Illegal block #19 (6553701) in inode 336893. CLEARED. Illegal block #20 (6357024) in inode 336893. CLEARED. Illegal block #22 (7602281) in inode 336893. CLEARED. Illegal block #23 (7798816) in inode 336893. CLEARED. Illegal block #24 (7536737) in inode 336893. CLEARED. Too many illegal blocks in inode 336893. Clear inode<y>? yes Restarting e2fsck from the beginning... /dev/mapper/eeepc-suvai contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 336889: 1347590 1347591 1347592 1347593 1347595 1347597 Multiply-claimed block(s) in inode 336890: 1898439 1898442 1898443 1898444 1898445 1898446 1898513 1898514 1898515 1898516 1898517 1898518 1898519 1898520 1898521 1898522 1898543 1898544 1898545 Multiply-claimed block(s) in inode 336891: 1361951 1361952 1361961 1361962 1361963 1361964 1361965 1361966 1361967 1361968 1361969 1361970 1361971 1361972 1361973 1361974 1361975 1361976 1361977 1361978 Multiply-claimed block(s) in inode 336892: 1415316 1415317 1415318 1415319 1415320 1415321 1415344 1415345 Multiply-claimed block(s) in inode 336894: 1415359 1415382 1416689 Multiply-claimed block(s) in inode 336895: 1898381 1898382 1898383 1898384 1898385 1898386 1898387 1898388 1898389 1898390 1898391 1898392 1898393 1898394 1898395 1898396 1898397 1898398 1898399 1898400 1898401 1898402 1898403 1898404 1898405 1898406 1898407 1898408 1898409 1898410 1898411 1898412 1898413 1898414 1898415 1898416 1898417 1898418 1898419 1898420 1898421 1898422 1898423 1898424 1898425 1898426 1898427 1898428 1898429 1898430 1898431 1898432 1898433 1898434 1898435 1898436 1898437 1898438 Multiply-claimed block(s) in inode 336896: 1361922 1361923 1361924 1361925 1361926 1361927 1361929 1361930 1361931 1361932 1361933 1361934 1361935 1361947 1361948 1361949 1361950 1361979 1361980 Multiply-claimed block(s) in inode 336897: 1347590 1347591 1347592 1347593 1347595 1347597 Multiply-claimed block(s) in inode 336898: 1898439 1898442 1898443 1898444 1898445 1898446 1898513 1898514 1898515 1898516 1898517 1898518 1898519 1898520 1898521 1898522 1898543 1898544 1898545 Multiply-claimed block(s) in inode 336899: 1361951 1361952 1361961 1361962 1361963 1361964 1361965 1361966 1361967 1361968 1361969 1361970 1361971 1361972 1361973 1361974 1361975 1361976 1361977 1361978 Multiply-claimed block(s) in inode 336900: 1415316 1415317 1415318 1415319 1415320 1415321 1415344 1415345 Multiply-claimed block(s) in inode 336902: 1415359 1415382 1416689 Multiply-claimed block(s) in inode 336903: 1898381 1898382 1898383 1898384 1898385 1898386 1898387 1898388 1898389 1898390 1898391 1898392 1898393 1898394 1898395 1898396 1898397 1898398 1898399 1898400 1898401 1898402 1898403 1898404 1898405 1898406 1898407 1898408 1898409 1898410 1898411 1898412 1898413 1898414 1898415 1898416 1898417 1898418 1898419 1898420 1898421 1898422 1898423 1898424 1898425 1898426 1898427 1898428 1898429 1898430 1898431 1898432 1898433 1898434 1898435 1898436 1898437 1898438 Multiply-claimed block(s) in inode 336904: 1361922 1361923 1361924 1361925 1361926 1361927 1361929 1361930 1361931 1361932 1361933 1361934 1361935 1361947 1361948 1361949 1361950 1361979 1361980 Pass 1C: Scanning directories for inodes with multiply-claimed blocks Pass 1D: Reconciling multiply-claimed blocks (There are 14 inodes containing multiply-claimed blocks.) ...... ........ Pass 2: Checking directory structure Entry '#336913' in /lost+found (11) has deleted/unused inode 336913. Clear<y>? yes Entry '#336914' in /lost+found (11) has deleted/unused inode 336914. Clear<y>? yes Entry '#336915' in /lost+found (11) has deleted/unused inode 336915. Clear<y>? yes Entry '#336916' in /lost+found (11) has deleted/unused inode 336916. Clear<y>? yes Entry '#336917' in /lost+found (11) has deleted/unused inode 336917. Clear<y>? yes Entry '#336918' in /lost+found (11) has deleted/unused inode 336918. Clear<y>? yes Entry '#336919' in /lost+found (11) has deleted/unused inode 336919. Clear<y>? yes Entry '#336920' in /lost+found (11) has deleted/unused inode 336920. Clear<y>? yes Pass 3: Checking directory connectivity Pass 4: Checking reference counts Inode 336558 ref count is 12, should be 11. Fix<y>? yes Unattached inode 336898 Connect to /lost+found<y>? yes Inode 336898 ref count is 2, should be 1. Fix<y>? yes Unattached inode 336899 Connect to /lost+found<y>? yes Inode 336899 ref count is 2, should be 1. Fix<y>? yes Unattached inode 336902 Connect to /lost+found<y>? yes Inode 336902 ref count is 2, should be 1. Fix<y>? yes Unattached inode 336904 Connect to /lost+found<y>? yes Inode 336904 ref count is 2, should be 1. Fix<y>? yes Pass 5: Checking group summary information Block bitmap differences: -(1344019--1344024) -(1347586--1347589) -1347594 -1347596 -(1347598--1347599) -(1349632--1349638) -1353729 -(1415305--1415308) -(1415329--1415332) -(1415338--1415343) -(1902878--1902967) Fix<y>? yes Free blocks count wrong for group #0 (31076, counted=30947). Fix<y>? yes Free blocks count wrong for group #41 (19489, counted=19511). Fix<y>? yes Free blocks count wrong for group #43 (1, counted=15). Fix<y>? yes Free blocks count wrong for group #58 (13532, counted=13622). Fix<y>? yes Free blocks count wrong (2373548, counted=2373545). Fix<y>? yes Inode bitmap differences: -336893 -(336913--336920) Fix<y>? yes Free inodes count wrong for group #41 (7031, counted=7040). Fix<y>? yes Directories count wrong for group #41 (254, counted=253). Fix<y>? yes Free inodes count wrong (1302827, counted=1302836). Fix<y>? yes /dev/mapper/eeepc-suvai: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/eeepc-suvai: 7884/1310720 files (3.9% non-contiguous), 2869335/5242880 blocks suvaipc:~# fsck.ext4 /dev/mapper/eeepc-suvai e2fsck 1.41.3 (12-Oct-2008) /dev/mapper/eeepc-suvai: clean, 7884/1310720 files, 2869335/5242880 blocks suvaipc:~# exit Script done on Mon Sep 28 11:17:59 2009 Hi, I'm trying to track down a problem relating to ext4 problems, in kernel bugzilla 14354. Originally it was thought this was a regression 2.6.31->2.6.32-rc1 regression, but the Aneesh Kumar's report shows a very similar fsck transcript using a 2.6.30-1 kernel (which makes makes it unclear whether this is really a regression or not), and all three reports are tied together by the use of device-mapper. The fsck logs are consistent barriers being disabled. As I recall under some circumstances device mapper silently drops barrier requests --- I thought this was fixed for device mapper in some cases. What's the current status of barrier requests in device mapper? Are they faithfully passed for standard (non-RAID) LVM2 volumes? When was this changed? What about dm-crypto volumes? (One of the reports was with a dm-crypt volume from what I can tell.) Thanks, - Ted I'm not pretty sure about lvm, i do not use lvm on my laptop, but i had corruption problem on it too. Only bad thing happened here was ext4 oops durning bisect. But corruption happened aboun one week after this oops. I use lvm on my PC but i had corrupted sector on home partition too (home is not lvm). Alexey, do you have anything approaching reliable reproduction case? What you stated in comment #6 implied that you could reliably reproduce this. If you do, can you give me details about the hardware and kernel versions that were used, and the dmesg outputs, and whether LVM and/or device-mapper was involved? *** Bug 14370 has been marked as a duplicate of this bug. *** (In reply to comment #23) > When you do these "forced resets," are they power cycles? Yes, forced resets mean. I press the power button of my notebook and wait for the device to be shut off and then power it again. Created attachment 23376 [details]
demsg host zwerg 2.6.32-rc2
The PC i used to reproduce this issue has only lvm. There was mast all versions between 2.6.31 and 2.6.32-rc3. Mostly i testing intel graphick issues.
Created attachment 23377 [details]
demsg host=zwerg 2.6.31
Created attachment 23378 [details]
dmesg hoost=mini v2.6.31
This ist my laptop. I do not use here any "dm" (no raid, lvm or crypt). here i had same issue. I used most version betveen 2.6.31 and 2.6.31-rc2 (bisction). After reinstall there is only 2.6.31 kernel.
Alexey, the dmesg outputs you are posting --- are these random dmesg's, or ones taken right before the system goes read-only, or the boot session *before* that? If you can easily trigger the file system corruption, something really useful to do would be to set up syslog to do remote logging, so we can get the EXT4 syslog just as or after the root filesystem gets remounted read-only. In addition, it would be useful to get the EXT4 syslog messages for one or two boot sessions *before* the file system was found to be corrupted, as well as the boot session where the file system was found corrupted. If the file system is getting corrupted after the journal is found corrupted due to a checksum mismatch, the syslog file may not have the EXT4 syslog messages, since the file system may end up getting remounted read-only. And in that case, it's useful to see if there are any EXT4 syslog messages or any messages from the device driver showing any I/O errors. The reason why I'm asking about LVM errors and dm-crypt is that at least one fsck log had the name /dev/mapper/sda5_crypt. At this point I'm still trying to find common elements and information leading to a reliable reproduction --- and something that can confirm that this is indeed a regression, so I know whether or not focusing on recent commits is productive. (The theory that it might have something to do with journal checksums is assuming that this is a regression, but so far none of the dmesg is showing the "Journal transaction XXXX is corrupt" that would be expected if the theory that this was caused by a journal checksum failure is correct. So I don't want to focus too soon on recent commits as potential causes of this problem until we know for certain it is a regression.) On Mon, Oct 12 2009 at 2:15pm -0400, Theodore Ts'o <tytso@mit.edu> wrote: > Hi, I'm trying to track down a problem relating to ext4 problems, in > kernel bugzilla 14354. Originally it was thought this was a regression > 2.6.31->2.6.32-rc1 regression, but the Aneesh Kumar's report shows a > very similar fsck transcript using a 2.6.30-1 kernel (which makes makes > it unclear whether this is really a regression or not), and all three > reports are tied together by the use of device-mapper. > > The fsck logs are consistent barriers being disabled. As I recall under > some circumstances device mapper silently drops barrier requests --- I > thought this was fixed for device mapper in some cases. What's the > current status of barrier requests in device mapper? Are they > faithfully passed for standard (non-RAID) LVM2 volumes? When was this > changed? What about dm-crypto volumes? (One of the reports was with a > dm-crypt volume from what I can tell.) Barrier infrastructure started to get added to the DM core in 2.6.30, see: http://git.kernel.org/linus/af7e466a1ace But barriers were not enabled for all DM targets (_except_ dm-multipath) until 2.6.31. So 2.6.31's dm-crypt does support barriers, see: http://git.kernel.org/linus/647c7db14ef9 If the underlying device(s) support barriers DM should faithfully pass them on (again except for dm-multipath). Also, requests with barriers that result in -EOPNOTSUPP are retried without the barrier, see: http://git.kernel.org/linus/51aa32284958 Mike We used to issue EOPNOTSUPP in response to barriers (so flushing ceased to be supported when it became barrier-based). 'Basic' barrier support was added first (2.6.30-rc2), as Mike says, by waiting for relevant I/O to complete. Then this was extended (2.6.31-rc1) to send barriers to the underlying devices for most dm types of dm targets. To see which dm targets in a particular source tree forward barriers run: grep 'ti->num_flush_requests =' drivers/md/dm*c (set to a non-zero value). I am not sure that "waiting" will do what it needs to do. To do barriers properly, we need to control the volatile cache state of the target device which requires waiting and then sending an explicit cache flush. Sounds like barriers should be working properly in 2.6.31 though. If there is any sort of eventual reproducer, we could test the barrier theory by disabling write caches on the constituent drive(s) with hdparm. -Eric @Theodore, i can reproduce datalost not readonly bug, this happened mostly with 2.6.32-rcX. I had some problems with readonly root, but it seems to be other bug in karmik upstart. See: https://bugs.launchpad.net/ubuntu/+source/upstart/+bug/434358 In my datalost issue, if some thing going wrong and kernel oops or freeze ( i need to poweroff the system) it seems like the _possability_ to lost some thing is match bigger if i use 2.6.32-rcX kernel. For example i get corrupted files which was written one week before the system was really broken. Alexey, when you say corrupted files what do you mean. Are the files missing? Full of 0s? Random data in part or all of the file? As for data loss, -some- data loss on a power fail is not unexpected; buffered data which has not been synced -will- be lost, so we need to be careful about how we use these terms I think ... -Eric Alexey, in addition to Eric's very good comments, the other thing that's worth checking is to do a "sudo touch /forcefsck" (which will force an fsck on the next reboot) before shutting down your system -- or do it early in your login session, before you do something dangeorus that might require a forced reboot/power cycle of your system). We need to distinguish between a file which was being replaced at the time of the power failure (and some desktop applications/toolkits are very update-happy in terms of constantly rewriting application dot files) versus the file system actually becoming corrupt. Created attachment 23398 [details]
2: reboot + mounting ro after journal reply
~2.6.32-rc2 crashed (due a bug in the intel driver), I rebooted into 2.6.31 but shortly after the journal reply and after the fsck that said everything the fs got mounted readonly.
Created attachment 23400 [details] 2: fsck on issue detected in attachment 23398 [details] This is the fsck log. Again indepedent files use the same block. In this case the .o and other files created during the kernel compile. I will try to prepare my netbook to work on a reproducable testcase. Hmm.... what were you doing right before the crash? It looks like you were doing a kernel compile in /home/ich/source/linux/linux-2.6, since there were files with a modtime of Tue Oct 13 16:25:55 2009. What's a funny is that when these files were allocated, they used blocks that were apparently already in use by other object files in that some source directory with a mod-time of Sat Oct 10 13:51:14 2009. Did you do a "make clean" at any time between Saturday and Tuesday that should have deleted these files? If so, what I would strongly recommend is to run e2fsck -f on /dev/mapper/sda5_crypt before you mount it, each time. What seems to be happening is that block allocation bitmap is getting corrupted somehow. This is what causes the multiply claimed bitmaps. I'm guessing the file system had gotten corrupted already, before the this last boot session. The trick is catching things *before* the filesystem is badly corrupted that it gets remounted read-only and fsck has a large amount of multiply-claimed inodes to cleanup. This is important for two reasons: (a) it helps us localize when the initial file system corruption is taking place, which helps us find a reproduction case, and (b) it reduces the chances that you'll lose data. So the problem is that /dev/mapper/sda5_crypt is your root filesystem, and it's using dm-crypt. *Sigh* this actually makes life difficult, since last I checked we can't do LVM snapshots of dm-crypt devices. So that means you can't use the e2croncheck script, which is what I would recommend. (What I'm actually doing right now is after every crash, I'm rebooting, logging in, and running e2croncheck right after I log in. This allows me to notice any potential file system corruptions before it gets nasty --- the problem is I'm not noticing the problem.) E2croncheck is much more convenient, since I can be doing other things while the e2fsck is running in one terminal window. But I suspect dm-crypt is going to make this impossible. One thing you could do is "tune2fs -c 1 /dev/mapper/sda5_crypt". This will force a full check after every single reboot. This will slow down your reboot (fortunately ext4 is faster at fsck times, but unfortunately sda5 appears to be a 211 GB filesystem, and it appears to be converted from an old ext3 filesystem, so you won't see the full 10x speedup in fsck times that you would if this was a created-from-scratch ext4 file system), but if you do this while trying to find the problem, it would be very helpful. As I said, I'm still trying to reproduce the problem on my end, but it's been hard for me to find a reproduction case. (In reply to comment #43) > Hmm.... what were you doing right before the crash? It looks like you were > doing a kernel compile in /home/ich/source/linux/linux-2.6, since there were > files with a modtime of Tue Oct 13 16:25:55 2009. What's a funny is that > when these files were allocated, they used blocks that were apparently > already > in use by other object files in that some source directory with a mod-time of > Sat Oct 10 13:51:14 2009. Did you do a "make clean" at any time between > Saturday and Tuesday that should have deleted these files? Not a make clean but a "rm -rf *; git checkout ." after the previous corruption in the source/linux/linux-2.6 directory. Is it feasible to come up with a self test patch to ext4 to compare the allocated block bitmap with the bitmap in memory? I will do a fsck on every boot now and will setup a testsystem in the evening to try to reproduce this (without dm/lvm inbetween). My idea for reproducing this would be: $ rm -rf * $ git checkout . $ make and then power cycle/reset during the make. I will report later on this. One other question (and I'm not familiar with extX code at all). If the hypothesis would be that the allocated bitmap is corrupted. From where can this happen? - In memory bitmap not written out, journal not claiming these blocks, reassigning them during runtime? This would be a journal bug? - Block allocation at runtime forgetting to update the bit and block allocation assigning the same block again? This would sound like a missing lock somewhere? - delayed allocation picking the same block? Oh... after all your comments the things making more sans. Now i see the corruption happened probably before i found it. I'll assume this was after last fsck, so like two or three week ago. Thirst of all i need to say, my primer testing targets are intel graphic, acpi and ath9k (wifi). Other bugs will be mostly skipped. Like i said this bug was mixed with other ubuntu karmik bug which made it more complicated to find real when was this bug introduced. The karmik bug was introduced in alpha ~2 - it will fail to boot on fsck. Normally i did "fsck /root" or "mount -o remount,ro && fsck". With kernel 2.6.32-rc1 i needed to bisect some intel graphic issue so i tested most version between 2.6.32-rc1 and 2.6.31.. some where shortly after 2.6.32-rc1 i get oops on of ext4, so i skipped testing till short after 2.6.32-rc2 (this is proboly the time where real corruption was). With ~2.6.32-rc3 e get oops on ~scsi when i tried to fdisk usbstick. While i needed to make this usbstick work, i rebooted (not reseted) and started with "stable" 2.6.31-ubuntu-13. This time i get same problem - PC wont boot, need to fsck (looks like karmik fsck issue but it isn't ... with message "bla bla /forcefsck" this was new for me), it didn't worked. remount,ro didt't worked too, reason /etc/mtab corrupted. cat /etc/mtab show only random trash in it. I started 2.6.31-ubuntu-13 to initrd and mount root manually in ro, run fsck -f -y /root returned a lot of messages like file1 and file2 use same sector and removing file3 bla bla... lostandfound... After reboot the system worked again, to find what files was lost i run debsums. I showed some packages which was installed one week before was broken. After i reinstalled broken packages seems like system work stable. Now, after i "systematically" trying to analise this, noticed that all my broken laptops used max 2.6.32-rc1 before die. Probably my major testing pc get corrupted at 2.6.31-rc1 time too. Now i use e2croncheck script in cron.deily will see if i can confirm it. Are you using the default upstream journal mode - writeback (not ordered)? Ric, The default mode is ordered for ext4; the default mode for ext3 is writeback. The reason for not making the change in ext4 is that with delayed allocation, we don't have the stalling problems that ext3 had. i get same issue on 2.6.32-rc4. I worked on uvc (webcam) module so, installed same module over and over. At some time i accidentally pluged webcam off during it was used by cheese. As result i get kernel oops ... restart didn't worked so i reseted it. Blame me i forgot o save oops... i have now only dmesg after restart, fsck and dd-image of the root before dmesg. [ 5.573939] EXT4-fs (dm-0): INFO: recovery required on readonly filesystem [ 5.573942] EXT4-fs (dm-0): write access will be enabled during recovery [ 5.628089] EXT4-fs (dm-0): orphan cleanup on readonly fs [ 5.640860] EXT4-fs warning (device dm-0): ext4_orphan_get: bad orphan inode 133701! e2fsck was run? [ 5.640863] ext4_test_bit(bit=2628, block=524304) = -1 [ 5.640864] inode=ffff880138764840 [ 5.640865] is_bad_inode(inode)=0 [ 5.640866] NEXT_ORPHAN(inode)=1255554482 [ 5.640867] max_ino=1220608 [ 5.640867] i_nlink=0 [ 5.640895] EXT4-fs (dm-0): recovery complete [ 5.807823] EXT4-fs (dm-0): mounted filesystem with ordered data mode .... [ 10.673483] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 394941 [ 10.673626] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 394941 [ 10.673699] EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 394941 ... fsck has lots of: File /lib/modules/2.6.32-rc4-00094-g80f5069-dirty/kernel/arch/x86/kernel/cpu/cpu_debug.ko (inode #393904, mod time Wed Oct 14 22:47:35 2009) has 57 multiply-claimed block(s), shared with 2 file(s): ... (inode #394934, mod time Wed Oct 14 21:53:34 2009) /lib/modules/2.6.32-rc4-00094-g80f5069-dirty/modules.symbols.bin (inode #394931, mod time Wed Oct 14 21:53:33 2009) Clone multiply-claimed blocks? yes Created attachment 23411 [details]
dmesg 2.6.32-rc4 host=zwerg
Created attachment 23412 [details]
fsck 2.6.32-rc4 host=zwerg
BTW, I've been doing additional tests where I'm doing a kernel or e2fsprogs build on my laptop, and then force a power down/crash while the build is running, and I still haven't been able to reproduce the bug on my end. Can someone give me a hint about what else might be happening on the system when the crash happens? Thanks!! Don't worry. you'll get it. Just take silver bullet and shoot the beast: echo c > /proc/sysrq-trigger This killed my root really hard. I still trying to reanimate it A few times recently I had corruption very similar to what's reported here, with Ext4 on an ordinary disk partition with no DM or anything fancy. Personally I'm suspicious of the new boot scripts in Karmic. In trying to be as concurrent as possible, maybe it's doing something really bad like running twice and starting two concurrent instances of fsck on the same partition, or running fsck while the partition is already mounted with write access. I can confirm an issue where it doesn't reboot after directing you to manually run a disk check and instead it tries to remount the root partition as rw, reported here: https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/452196 But that doesn't explain the disk getting corrupted in the first place. hmm.. that's an interesting suggestion. Is anyone who is reporting problems _not_ using Karmic userspace? I'm using Ubuntu 9.04.... maybe that's why I can't reproduce it? I encountered the same on Debian Sid, which still uses SysV init. I also use a plain partition, no DM or LVM. On Tue, 13 Oct 2009, Mike Snitzer wrote:
> On Mon, Oct 12 2009 at 2:15pm -0400,
> Theodore Ts'o <tytso@mit.edu> wrote:
>
> > Hi, I'm trying to track down a problem relating to ext4 problems, in
> > kernel bugzilla 14354. Originally it was thought this was a regression
> > 2.6.31->2.6.32-rc1 regression, but the Aneesh Kumar's report shows a
> > very similar fsck transcript using a 2.6.30-1 kernel (which makes makes
> > it unclear whether this is really a regression or not), and all three
> > reports are tied together by the use of device-mapper.
> >
> > The fsck logs are consistent barriers being disabled. As I recall under
> > some circumstances device mapper silently drops barrier requests --- I
> > thought this was fixed for device mapper in some cases. What's the
> > current status of barrier requests in device mapper? Are they
> > faithfully passed for standard (non-RAID) LVM2 volumes? When was this
> > changed? What about dm-crypto volumes? (One of the reports was with a
> > dm-crypt volume from what I can tell.)
>
> Barrier infrastructure started to get added to the DM core in 2.6.30, see:
> http://git.kernel.org/linus/af7e466a1ace
>
> But barriers were not enabled for all DM targets (_except_ dm-multipath)
> until 2.6.31. So 2.6.31's dm-crypt does support barriers, see:
> http://git.kernel.org/linus/647c7db14ef9
>
> If the underlying device(s) support barriers DM should faithfully pass
> them on (again except for dm-multipath). Also, requests with barriers
> that result in -EOPNOTSUPP are retried without the barrier, see:
> http://git.kernel.org/linus/51aa32284958
>
> Mike
Hi
Device mapper never drops data in the barrier request.
It converts barrier to non-barrier request or drops zero-sized barrier if
the underlying device doesn't support barrier. Device mapped doesn't
support barriers for dm-raid1 target (pending some other upstream patch
that was ignored).
If the underlying device doesn't support barriers or if dm-raid1 target is
being used, device mapper returns success on barriers and it simply waits
for the request queue to drain when accepting a barrier.
The problems with "not flushing disk cache" only show up only if you pull
the power plug out --- in this case, make sure that if you use dm-raid1,
disk cache is turned off (hdparm -W 0). With other dm-targets, you can
leave the disk cache on, because they accept barriers.
If you see errors under normal operations, without power loss, then it has
probably nothing to do with barriers at all.
Mikulas
Alexey, where you doing anything in particular when you did the "echo c > /proc/sysrq-trigger"? Does it reproduce for you every single time? I did ("echo c > /proc/sysrq-trigger") while running a "make -j4" in an e2fsprogs source tree, and after the reboot, both on the root logical volume (where I have a ccache directory on an SSD drive) and my build logical volume (where I have an e2fsprogs source and build tree) were just fine after rebooting. As I said, I'm having a devil of a time reproducing this. I've tried power failures (removing the battery and pulling the power cord) while the build was taking place, suspending the system and then forcing a cold restart, suspending the system and then pulling the power, using "echo c > /proc/sysrq-trigger", etc. After each time, I've used e2croncheck to verify the sanity of my ssd root partition, and (in case the problem is the ext4 journal replay), mounting and then unmounting /kbuild, and then running e2fsck -f /dev/closure/kbuild. In all cases, I've not had any file system inconsistencies. (using 2.6.32-rc1, 2.6.32-rc3, and most recently, 2.6.32-rc4). I really need some help reproducing this! On Fri, Oct 16, 2009 at 12:28:18AM -0400, Parag Warudkar wrote: > So I have been experimenting with various root file systems on my > laptop running latest git. This laptop some times has problems waking > up from sleep and that results in it needing a hard reset and > subsequently unclean file system. A number of people have reported this, and there is some discussion and some suggestions that I've made here: http://bugzilla.kernel.org/show_bug.cgi?id=14354 It's been very frustrating because I have not been able to replicate it myself; I've been very much looking for someone who is (a) willing to work with me on this, and perhaps willing to risk running fsck frequently, perhaps after every single unclean shutdown, and (b) who can reliably reproduce this problem. On my system, which is a T400 running 9.04 with the latest git kernels, I've not been able to reproduce it, despite many efforts to try to reproduce it. (i.e., suspend the machine and then pull the battery and power; pulling the battery and power, "echo c > /proc/sysrq-trigger", etc., while doing "make -j4" when the system is being uncleanly shutdown) So if you can come up with a reliable reproduction case, and don't mind doing some experiments and/or exchanging debugging correspondance with me, please let me know. I'd **really** appreciate the help. Information that would be helpful to me would be: a) Detailed hardware information (what type of disk/SSD, what type of laptop, hardware configuration, etc.) b) Detailed software information (what version of the kernel are you using including any special patches, what distro and version are you using, are you using LVM or dm-crypt, what partition or partitions did you have mounted, was the failing partition a root partition or some other mounted partition, etc.) c) Detailed reproduction recipe (what programs were you running before the crash/failed suspend/resume, etc.) If you do decide to go hunting this problem, one thing I would strongly suggest is that either to use "tune2fs -c 1 /dev/XXX" to force a fsck after every reboot, or if you are using LVM, to use the e2croncheck script (found as an attachment in the above bugzilla entry or in the e2fsprogs sources in the contrib directory) to take a snapshot and then check the snapshot right after you reboot and login to your system. The reported file system corruptions seem to involve the block allocation bitmaps getting corrupted, and so you will significantly reduce the chances of data loss if you run e2fsck as soon as possible after the file system corruption happens. This helps you not lose data, and it also helps us find the bug, since it helps pinpoint the earliest possible point where the file system is getting corrupted. (I suspect that some bug reporters had their file system get corrupted one or more boot sessions earlier, and by the time the corruption was painfully obvious, they had lost data. Mercifully, running fsck frequently is much less painful on a freshly created ext4 filesystem, and of course if you are using an SSD.) If you can reliably reproduce the problem, it would be great to get a bisection, or at least a confirmation that the problem doesn't exist on 2.6.31, but does exist on 2.6.32-rcX kernels. At this point I'm reasonably sure it's a post-2.6.31 regression, but it would be good to get a hard confirmation of that fact. For people with a reliable reproduction case, one possible experiment can be found here: http://bugzilla.kernel.org/show_bug.cgi?id=14354#c18 Another thing you might try is to try reverting these commits one at a time, and see if they make the problem go away: d0646f7, 5534fb5, 7178057. These are three commits that seem most likely, but there are only 93 ext4-related commits, so doing a "git bisect start v2.6.31 v2.6.32-rc5 -- fs/ext4 fs/jbd2" should only take at most seven compile tests --- assuming this is indeed a 2.6.31 regression and the problem is an ext4-specific code change, as opposed to some other recent change in the writeback code or some device driver which is interacting badly with ext4. If that assumption isn't true and so a git bisect limited to fs/ext4 and fs/jbd2 doesn't find a bad commit which when reverted makes the problem go away, we could try a full bisection search via "git bisect start v2.6.31 v2.6.31-rc3", which would take approximately 14 compile tests, but hopefully that wouldn't be necessary. I'm going to be at the kernel summit in Tokyo next week, so my e-mail latency will be a bit longer than normal, which is one of the reason why I've left a goodly list of potential experiments for people to try. If you can come up with a reliable regression, and are willing to work with me or to try some of the above mentioned tests, I'll definitely buy you a real (or virtual) beer. Given that a number of people have reported losing data as a result, it would **definitely** be a good thing to get this fixed before 2.6.32 is released. Thanks, - Ted seems like i can mostly reproduce this issue on my pc. The system seems to be normal on reset after fresh start. Usually i lost files on root partition, some times on home evolution settings. Seems like evolution doing "wrong". My best reproduction case is reinstall of kernel modules (sudo make modules_install). The worst thing happened - all modules of this kernel version was lost. To reproduce i did "make modules_install" three times with random times between installs (10min-1h). I also use daily e2croncheck and some time i run it manually - i newer get any errors with this script. Today i will test vanilla 2.6.31 and probably begin to bisect. Alexey On Fri, Oct 16, 2009 at 05:15:58AM -0400, Theodore Tso wrote:
> These are three commits that seem most likely, but there are
> only 93 ext4-related commits, so doing a "git bisect start v2.6.31
> v2.6.32-rc5 -- fs/ext4 fs/jbd2"
One correction. The git bsect start usage is:
git bisect start [<bad> [<good>...]] [--] [<paths>...]
So the correct git bisect start command should be:
git bisect start v2.6.32-rc5 v2.6.31 -- fs/ext4 fs/jbd2
And similarly, "git bisect start v2.6.31 v2.6.31-rc3" should have been:
git bisect start v2.6.31-rc3 v2.6.31
My apologies for any confusion.
- Ted
Alexey, if reinstalling kernel modules is your best reproduction case, are you seeing actual file system corruption, or simply zero-length files? And are you using the "sync" command after re-installing the kernel modules? If it is actual file system corruption, try mounting the root filesystem with "auto_da_alloc=0" That is, add the boot command-line option "rootflags=auto_da_alloc=0". This will probably increase the likelihood of zero-length files if you don't use "sync" after reinstalling the kernel modules --- this is to be expected, since we are disabling the hueristic used to try to avoid zero-length files which aren't synced out to disk before an unclean shutdown. The goal of this experiment is to see if maybe there's something about this hueristic which is screwing up the ability to recover the filesystem free of corruptions after an unclean shutdown. First result with v2.6.31: I made some scripts to make me kill my pc easier. Now i log fsck, dmesg and md5sum every time after i kill it. 2.6.31 acting worse in this situation than 2.6.32 do. On mount readonly, kernel trying to recovery fs (... 6 orphan inodes deleted ... recovery complete) fsck -y -f return no warnings (only "FILE SYSTEM WAS MODIFIED"). After reboot there is many _corrupt_ modules, files are existing but do not match (md5sum) with original files... diff <(xxd file_corrupt) <(xxd file_orig) show about 30% of difference. I expected fsck will remove korrupted files warn me about this. I do not using sync in my script. This will be really nice if kernel did sync before pc die. I will test now v2.6.32-rc4 again with md5sum and see the difference. After this i will try auto_da_alloc=0 Reply-To: rwheeler@redhat.com On 10/16/2009 05:15 AM, Theodore Tso wrote: > On Fri, Oct 16, 2009 at 12:28:18AM -0400, Parag Warudkar wrote: > >> So I have been experimenting with various root file systems on my >> laptop running latest git. This laptop some times has problems waking >> up from sleep and that results in it needing a hard reset and >> subsequently unclean file system. >> > A number of people have reported this, and there is some discussion > and some suggestions that I've made here: > > http://bugzilla.kernel.org/show_bug.cgi?id=14354 > > It's been very frustrating because I have not been able to replicate > it myself; I've been very much looking for someone who is (a) willing > to work with me on this, and perhaps willing to risk running fsck > frequently, perhaps after every single unclean shutdown, and (b) who > can reliably reproduce this problem. On my system, which is a T400 > running 9.04 with the latest git kernels, I've not been able to > reproduce it, despite many efforts to try to reproduce it. (i.e., > suspend the machine and then pull the battery and power; pulling the > battery and power, "echo c> /proc/sysrq-trigger", etc., while > doing "make -j4" when the system is being uncleanly shutdown) > I wonder if we might have better luck if we tested using an external (e-sata or USB connected) S-ATA drive. Instead of pulling the drive's data connection, most of these have an external power source that could be turned off so the drive firmware won't have a chance to flush the volatile write cache. Note that some drives automatically write back the cache if they have power and see a bus disconnect, so hot unplugging just the e-sata or usb cable does not do the trick. Given the number of cheap external drives, this should be easy to test at home.... Ric > So if you can come up with a reliable reproduction case, and don't > mind doing some experiments and/or exchanging debugging correspondance > with me, please let me know. I'd **really** appreciate the help. > > Information that would be helpful to me would be: > > a) Detailed hardware information (what type of disk/SSD, what type of > laptop, hardware configuration, etc.) > > b) Detailed software information (what version of the kernel are you > using including any special patches, what distro and version are you > using, are you using LVM or dm-crypt, what partition or partitions did > you have mounted, was the failing partition a root partition or some > other mounted partition, etc.) > > c) Detailed reproduction recipe (what programs were you running before > the crash/failed suspend/resume, etc.) > > > If you do decide to go hunting this problem, one thing I would > strongly suggest is that either to use "tune2fs -c 1 /dev/XXX" to > force a fsck after every reboot, or if you are using LVM, to use the > e2croncheck script (found as an attachment in the above bugzilla entry > or in the e2fsprogs sources in the contrib directory) to take a > snapshot and then check the snapshot right after you reboot and login > to your system. The reported file system corruptions seem to involve > the block allocation bitmaps getting corrupted, and so you will > significantly reduce the chances of data loss if you run e2fsck as > soon as possible after the file system corruption happens. This helps > you not lose data, and it also helps us find the bug, since it helps > pinpoint the earliest possible point where the file system is getting > corrupted. > > (I suspect that some bug reporters had their file system get corrupted > one or more boot sessions earlier, and by the time the corruption was > painfully obvious, they had lost data. Mercifully, running fsck > frequently is much less painful on a freshly created ext4 filesystem, > and of course if you are using an SSD.) > > If you can reliably reproduce the problem, it would be great to get a > bisection, or at least a confirmation that the problem doesn't exist > on 2.6.31, but does exist on 2.6.32-rcX kernels. At this point I'm > reasonably sure it's a post-2.6.31 regression, but it would be good to > get a hard confirmation of that fact. > > For people with a reliable reproduction case, one possible experiment > can be found here: > > http://bugzilla.kernel.org/show_bug.cgi?id=14354#c18 > > Another thing you might try is to try reverting these commits one at a > time, and see if they make the problem go away: d0646f7, 5534fb5, > 7178057. These are three commits that seem most likely, but there are > only 93 ext4-related commits, so doing a "git bisect start v2.6.31 > v2.6.32-rc5 -- fs/ext4 fs/jbd2" should only take at most seven compile > tests --- assuming this is indeed a 2.6.31 regression and the problem > is an ext4-specific code change, as opposed to some other recent > change in the writeback code or some device driver which is > interacting badly with ext4. > > If that assumption isn't true and so a git bisect limited to fs/ext4 > and fs/jbd2 doesn't find a bad commit which when reverted makes the > problem go away, we could try a full bisection search via "git bisect > start v2.6.31 v2.6.31-rc3", which would take approximately 14 compile > tests, but hopefully that wouldn't be necessary. > > I'm going to be at the kernel summit in Tokyo next week, so my e-mail > latency will be a bit longer than normal, which is one of the reason > why I've left a goodly list of potential experiments for people to > try. If you can come up with a reliable regression, and are willing > to work with me or to try some of the above mentioned tests, I'll > definitely buy you a real (or virtual) beer. > > Given that a number of people have reported losing data as a result, > it would **definitely** be a good thing to get this fixed before > 2.6.32 is released. > > Thanks, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I forgot to post my hardware: hdparm -i /dev/sda Model=WDC, FwRev=01.01A01, SerialNo=WD-WCAU45528406 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=50 BuffType=unknown, BuffSize=32767kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=no WriteCache=enabled Drive conforms to: Unspecified: ATA/ATAPI-1,2,3,4,5,6,7 Board intel ID45DG (45G + ICH10R, 4G ram, Core2 Duo CPU E8400). Do not trust dmesg or fsck.log, use md5sum! Ok, now testresults based on md5sums: 2.6.32-rc4 16 corrupt ( parts of files ) 120 not found (!!??) 2.6.31 7 corrupt 2.6.32-rc4 + auto_da_alloc=0 nothing lost :) for all test was used same script. should i do bisect? Reply-To: parag.lkml@gmail.com On Fri, Oct 16, 2009 at 5:15 AM, Theodore Tso <tytso@mit.edu> wrote: > > A number of people have reported this, and there is some discussion > and some suggestions that I've made here: > > Â Â Â Â http://bugzilla.kernel.org/show_bug.cgi?id=14354 Ok, I went through this bug report and here are some immediately useful things to note - 1) I am not running Karmic - I am running Jaunty x86_64 clean install 2) I am not using dm/lvm or anything fancy for that matter 3) So far, I have been able to reproduce this problem by just hitting the power button on my laptop when it is doing nothing. It also happens when waking up from s2ram and the laptop wasn't doing anything when it was suspended (I mean I wasn't copying/deleting stuff, wasn't running make - laptop was > So if you can come up with a reliable reproduction case, and don't > mind doing some experiments and/or exchanging debugging correspondance > with me, please let me know. Â I'd **really** appreciate the help. My laptop is a reliable test case - for me at least! I tried just now to abruptly reset the laptop and upon reboot there was fsck followed by another reboot only to have X fail to start and NetWorkManager segfault. At this point I am pretty sure I can reproduce it just be power cycling the laptop using the power button. After another fsck and a reboot it finally comes up. > > Information that would be helpful to me would be: > > a) Detailed hardware information (what type of disk/SSD, what type of > laptop, hardware configuration, etc.) SSD is a Corsair P256 latest firmware, been used on other machine without any issues. Laptop is HP EliteBook 8530p, 4GB RAM, Intel T9400 CPU, Intel WiFi, ATI 3650 GPU. No proprietary drivers ever loaded. > > b) Detailed software information (what version of the kernel are you > using including any special patches, what distro and version are you > using, are you using LVM or dm-crypt, what partition or partitions did > you have mounted, was the failing partition a root partition or some > other mounted partition, etc.) It happens on current custom compiled git - both on custom and minimal localmodconfig. It also happens on Ubuntu daily kernel PPA builds. > > c) Detailed reproduction recipe (what programs were you running before > the crash/failed suspend/resume, etc.) > Really nothing special - I boot to desktop, may be open FireFox for few minutes and then try reset. fsck then reports a bunch of errors and forces reboot. On reboot X fails to start, file system although mounted rw cannot be written to - vim for instance won't open any file due to write errors. Another fsck finds few more problems (or sometimes not) and reboot brings it back to desktop. So my problem is not corruption really but the amount and nature of errors fsck encounters and corrects on unclean shutdown and the write failures until another fsck -f finds more problems and reboots. None of this happens on any other filesystem including the /boot ext3 fs on the same disk. > > If you do decide to go hunting this problem, one thing I would > strongly suggest is that either to use "tune2fs -c 1 /dev/XXX" to > force a fsck after every reboot, or if you are using LVM, to use the > e2croncheck script (found as an attachment in the above bugzilla entry > or in the e2fsprogs sources in the contrib directory) to take a > snapshot and then check the snapshot right after you reboot and login > to your system. Â The reported file system corruptions seem to involve > the block allocation bitmaps getting corrupted, and so you will > significantly reduce the chances of data loss if you run e2fsck as > soon as possible after the file system corruption happens. Â This helps > you not lose data, and it also helps us find the bug, since it helps > pinpoint the earliest possible point where the file system is getting > corrupted. I have enabled fsck on every mount but I am not certain ongoing "clean state" corruption is the problem in my case. Things have worked well without any trouble if I don't end up doing a unclean shutdown. [ snip] > I'm going to be at the kernel summit in Tokyo next week, so my e-mail > latency will be a bit longer than normal, which is one of the reason > why I've left a goodly list of potential experiments for people to > try. Â If you can come up with a reliable regression, and are willing > to work with me or to try some of the above mentioned tests, I'll > definitely buy you a real (or virtual) beer. > I will try the things you mentioned - finding if this happens in pre .32 kernels is the first one on my list followed by reverting the specific commits you mentioned, followed if necessary by complete bisection. I am afraid however this is not a regression - at least not a recent one, as I have had this experience with ext4 and unclean shutdowns since long time. And that's on different hardware/different disks. Thanks, Parag I have an idea for anyone who gets this problem on a clean reboot. Do what it takes to reproduce the problem, then quit to single-user mode (sudo telinit 1) and try to remount root as read-only: mount -o ro,remount / If you can't do this because the filesystem is busy, look for and kill any process that might have a file open for writing. Remounting as read-only does not seem to be forcible, as the command appears to succeed, but you still have write access. So it's a good idea to 'touch /blah' to make sure you've turned off write access. Plan A. If you've successfully remounted as read-only, run 'e2fsck -f' on your filesystem. If you had no errors on the filesystem before, but you do now, you know the errors were introduced during the session. If there are no errors on this check and you never get the corruption on the next boot after doing this, it's probably a problem with data not getting flushed by the time the system reboots. Plan B. If you can't remount as read-only and can't find a process responsible for that to kill, run sync, then 'e2fsck -fn' on the partition and report the results. I just did a dist-upgrade on my desktop machine running Karmic and the standard Karmic 2.6.31 kernel, and I was going to do the above before rebooting, but had to resort to plan B because I couldn't remount / as read-only. Even after sync, e2fsck warned that it was skipping journal recovery, and it reported errors. Ted, is it to be expected that e2fsck would find a journal recovery necessary and see errors even if the filesystem was still mounted rw, but had been synced and no write access had taken place since the sync? I know no writing had taken place because I could run sync after the drive had spun down, and it wouldn't spin up again. Yet e2fsck continued to report the same result. I did 'reboot -f' after this to avoid any script shenanigans and used 'init=/bin/bash' on the next boot so I could manually examine the situation. When my root fs was been mounted read-only, it replayed the journal, then there were no errors on 'e2fsck -f'. I have no idea if that is useful for anyone. I have done the following. 0.) use dd to create a big file 1.) install a OpenSuSE 11.1 in it using kvm 2.) change the /etc/fstab and /sbin/init 2.1) /sbin/init is doing mount /proc fsck -f /dev/hda2 echo "press any key to continue" mount -oremount,rw / cd /source/linux-2.6 rm -rf * rm -f .git/index.lock git checkout . make -j 6 make -j 6 modules_install 3.) Use kvm to start with the disk and an external kernel '-kernel ./bzImage -append root=/dev/hda2" 4.) During the "make -j 6" I interrupt the kernel compile. Result every ~10th time the filesystem is left with some errors (where you have to Y/N for fsck) after the journal recovery. The errors are lost inodes (to be attached to lost+found), zero length inodes and such. The benefit of using the above setup: - non karmic userspace (kvm still running on karmic though) - no dm (disk image still hosted on a dm device) - can copy/clone the disk image for each kernel run and use debugfs to look at it. I will post the next fsck log with errors and then I can upload it somewhere if it is of any interest. you don't have to edit init script, all you need is to boot with "break=mount". On ubuntu/debian this will stop boot in initrd without mounted root. After this you can mount it read only or what ever you need. But you can't fsck it, initrd do not include such tools. Continue: mount -o ro /dev/sda1_your_root /root mount -o bind /dev /root/dev chroot /root /bin/bash now you'll have clean mount without any other corruptions caused by init script bug. Please can anyone place this "auto_da_alloc=0". On my system this work around data lost issue. Can any one confirm this too? /place "auto_da_alloc=0 to fsck/ oh... man... s/fsck/fstab/g cat fstab: /dev/sda1 / ext4 auto_da_alloc=0,errors=remount-ro 0 1 Created attachment 23442 [details]
kernel log
dmesg when I noticed corruption
Created attachment 23443 [details]
fsck log
Happened on my desktop system. It has ubuntu 9.04, ext4, and intel drivers. I was troubleshooting a GPU hang that happened after I switch modes in an SDL game. Same pattern, few crashes, duplicated blocks, and now I set the fsck to run after each reboot. And I had a crash before this, and thus it very likely that indeed free block bitmap is corrupted. i tested 2.6.30, nothing in fsck.log or dmesg, md5sum found 6 corrupted files. I assume corruted files is not a bug :) and begin to bisect lost files between 2.6.31 and 2.6.32-rc4 Alexey, There is a very big difference between _files_ being corrupted and the _file_ _system_ being corrupted. Your test, as I understand it, is a "make modules_install" from a kernel source tree, followed immediately by a forced crash of the system, correct? Are you doing an "rm -rf /lib/modules/2.6.32-XXXX" first, or are you just doing a "make modules_install" and overwriting files. In any case, if you don't do a forced sync of the filesystem, some of the recently written files will be corrupted. (Specifically, they may be only partially written, or truncated to zero-length.) This is normal and to be expected. If you want to make sure files are written to stable storage, you *must* use sync or fsync(3). This is true for pretty much any file system, by the way. If you have a script that looks something like this #!/bin/sh rm -rf /lib/modules/`uname -r` make modules_install echo c > /proc/sysrq-trigger you _will_ end up with some files being missing, or not fully written out. Try it with ext3, xfs, btrfs, reseirfs. All Unix filesystems have some amount of asynchronous writes, because otherwise performance would suck donkey gonads. You can try to mount with -o sync, just to see how horrible things would be. So what do you do if you have a "precious" file --- a file where you want to update its contents, but you want to make absolutely sure either the old file or the new file's contents will still be present? Well, you have to use fsync(). Well-written text editors and things like mail transfer angents tend to get this right. Here's one right way of doing it: 1) fd = open("foobar.new", O_CREAT|O_TRUNC, mode_of_foobar); 2) /* copy acl's, extended attributes from foobar to foobar.new */ 3) write(fd, buf, bufsize); /* Write the new contents of foobar */ 4) fsync(fd); 5) close(fd); 6) rename("foobar.new", "foobar"); The basic idea is you write the new file, then you use fsync() to guarantee that the contents have been written to disk, and then finally you rename the old file on top of the old one. As it turns out, for a long time Linux systems were drop dead reliable. Unfortunately, recently with the advent of ACPI suspend/resume, which assumed that BIOS authors were competent and would test on OS's other than windows, and proprietry video drivers that tend to be super unreliable, Linux systems have started crashing more often. Worse yet, application writers are started getting sloppy, and would write code sequences like this when they want to update files: 1) fd = open("foobar", O_CREAT|O_TRUNCATE, default_mode); 2) write(fd, buf, bufsize); /* write the new contents of foobar */ 3) close(fd); Or this: 1) fd = open("foobar.new", O_CREAT|O_TRUNC, mode_of_foobar); 2) write(fd, buf, bufsize); /* Write the new contents of foobar */ 3) close(fd); 4) rename("foobar.new", "foobar"); I call the first "update-via-truncate" and the second "update-via-replace". Because with delayed allocation, files have a tendency to become zero-length if you update them without using fsync() and than an errant ACPI bios or buggy video driver takes your system down --- and because KDE was updating many more dot files than necessary, and firefox was writing half a megabyte of disk files for every single web click, people really started to notice problems. As a result, we have hueristics that detect update-via-rename and update-via-truncate, and if we detect this write pattern, we force a background writeback of that file. It's not a synchronous writeback, since that would destroy performance, but a very small amount of time after a close(2)'ing a file descript that was opened with O_TRUNCATE or which had been explicitly truncated down to zero using ftruncate(2) -- i.e., update-via-truncate --- , or after a rename(2) which causes an inode to be unlinked --- i.e., uodate-via-unlink --- the contents of that file will be written to disk. This is what auto_da_alloc=0 inhibits. So why is it that you apparently had no data loss when you used auto_da_alloc=0? I'm guessing because the file system activity entire script fit within a single jbd2 transaction, and the transaction never committed before the script forced a system crash. (Normally a transaction will contain five seconds of filesystem activity, unless (a) a program calls fsync(), or (b) there's been enough file system activity that a significant chunk of the journal space has been confused. One of the changes between 2.6.31 and 2.6.32-rc1 was a bugfix that fixed a problem in 2.6.31 where update-via-truncate wasn't getting detected. This got fixed in 2.6.32-rc1, and that does change when data gets forced out to disk. But in any case, if it's just a matter of the file contents not getting written to disk, that's expected if you don't use fsync() and you crash immediately afterwards. As I said earlier, all file systems will tend to lose data if you crash without first using fsync(). The bug which I'm interested in replicating is one where the actual _file_ _system_ is getting corrupted. But if it's just a matter of not using sync() or fsync() before a crash, that's not a bug. - Ted Thank you for explanation. While bisect fsck.log was for me like: i have broken files WTF fsck do not telling me anything. If you mean with "files system corruption": corrupted orphan, unattached inode, block bitmap difference...; yes this test clearly reproduce it. But only with md5sum i kan see how big are real damages. Here is the script i do: make modules_install echo "1v3 done" sleep 5m make modules_install echo "2v3 done" sleep 5m make modules_install echo "3v3 done do checksum" mount /dev/sda13 /mnt md5deep -r /lib/modules/$k_version/* > /mnt/md5sums umount /mnt sleep 1m echo c > /proc/sysrq-trigger (taking bug temporarily just so I can edit the attachment mime types...) (oh well, I guess even the assignee can't do that - setting text attachments as text/plain makes reading them a bit simpler...) I double checking it now, but is seems to be this patch: commit 91ac6f43317c0bf99969665f98016548011dfa38 Author: Frank Mayhar <fmayhar@google.com> Date: Wed Sep 9 22:33:47 2009 -0400 ext4: Make non-journal fsync work properly Teach ext4_write_inode() and ext4_do_update_inode() about non-journal mode: If we're not using a journal, ext4_write_inode() now calls ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc()) with a new "do_sync" parameter. If that parameter is nonzero _and_ we're not using a journal, ext4_do_update_inode() calls sync_dirty_buffer() instead of ext4_handle_dirty_metadata(). This problem was found in power-fail testing, checking the amount of loss of files and blocks after a power failure when using fsync() and when not using fsync(). It turned out that using fsync() was actually worse than not doing so, possibly because it increased the likelihood that the inodes would remain unflushed and would therefore be lost at the power failure. Signed-off-by: Frank Mayhar <fmayhar@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> i checked it one more time and i confirm my previous post. Alexey, how did you check it? What happens when you use build commit 91ac6f43? Does the output of fsck show file system corruption after running your script? And what happens if you build commit fe188c0e (aka commit 91ac6f43^ aka the parent or immediate predecessor of commit 91ac6f43)? Does the problem reliably go away, which is to say fsck no longer reports file system corruption after you use the immediate predecessor of commit 91ac6f32, but which is there when you use commit 91ac6f32? This commit is supposed to not change anything if you use it with a journal (which is the ext4 default). I've reviewed the patch again, and it looks like it should be changing anything for a default ext4 filesystem with a journal present --- which is why I'm asking you exactly how you confirmed that this commit is the problem one. Maybe I'm missing something, but I don't see how the presence or absence of this patch could be changing anything. Thanks for all of your work, even if it's causing me to be even more mystified about what is going on. Created attachment 23456 [details]
logs of all bisected kernels
Here are logs of all tested kernels, for all test was used same script from my previous post. Only 2.6.32-rc4-00094-g80f5069-dirty and 2.6.31 should be a bit different... for this kernel i had thirst version of script.
After crash i started into initrd with option break=mount. So wen can exclude any karmik initscript bugs. After it i mounted ro rootfs and chrooted to it. At this point i run my checkfs script:
---------------------------------------------------------------------------
#!/bin/bash
current_date=$( date +%d_%m_%Y-%H_%M );
k_version=$( uname -r );
mount /dev/sda13 /mnt
cd /mnt
fsck -y -f /dev/mapper/zwerg_buntu-root_one > $k_version-$current_date.fsck
dmesg > $k_version-$current_date.dmesg
md5sum -c --quiet /mnt/md5sums > $k_version-$current_date.md5
-------------------------------------------------------------
I hope this will help you. Please check the logs if there any mistakes.
If you need more info, tell me soon. My system is badly broken now, i will need to reinstall it.
With commit fe188c0e i get "clean" fsck log with 91ac6f43 i get error like: Free blocks count wrong for group #0 (8935, counted=8922). Entry 'libnss_wins.so.2' in /var/spool/postfix/lib (155666) has deleted/unused inode 136151. Clear? yes Running additional passes to resolve blocks claimed by more than one inode... Pass 1B: Rescanning for multiply-claimed blocks Multiply-claimed block(s) in inode 131134: 557394 Multiply-claimed block(s) in inode 131201: 557394 ... you'll see it. To be sue i double checked two this kernels. (In reply to comment #84) Maybe this is a totally stupid suggestion, but is there any chance for this code path to be executed during journal replay, when s_journal is still NULL ? I reverted this patch manually against v2.6.32-rc5. Now there is no fs corruption and sometimes no file corruption. First part of patch (ext4_do_update_inode) was ready reverted, i did only ext4_write_inode part. Created attachment 23458 [details]
skip the non-journal branch if readonly
Comment on attachment 23458 [details]
skip the non-journal branch if readonly
Can you try this patch on top, if it makes any difference?
This patch working too, no fs corruption, no files corruption too. Created attachment 23468 [details] Testing patch; ro-diff with a WARN_ON(1) Could you try this patch on top of a stock 2.6.32-rc kernel? ext4_write_inode() should not be called before the journal is set up, and I'm not able to see how this could work. Yet if the ro-diff.patch in attachment #23458 [details] makes a difference working, then that must be what's happening. This is a variant on that patch that executes a printk and a WARN_ON(1) when if the conditional triggers; that will allow me to figure out how ext4_write_inode() is getting called early. I'm not able to replicate this on my system with this patch, but if it's working for you, then this should allow me to see what's actually happening. If we need to simply revert the change to ext4_write_inode(), I can do that, but I'd really like to see what's going on first. If in fact there is a patch where ext4_write_inode() can get called before ext4_load_journal(), which is what sets EXT4_SB(sb)->s_journal, then there's opportunities for other mischief/data corruption. Typo correction: "If there is a patch where ext4_write_inode() can get called" should have read "If there is a code path where ext4_write_inode() can get called..." Hmm... whÃth this patch i clearly do not have fs corruption but i don't see "Filesystem is read-only" in dmesg o_O . Currently i can't get dmesg with oops, there is no com connector and i filed to setup netconsole.. (i did it defore and it worked). Probably "msleep(1)" instead of this patch will have same result *jock*, not tested. Created attachment 23480 [details]
kernel config
How about my kernel config? i use almost the same config on all my pc's on different platforms i386/amd64. CPU - pentium M, athom, core2duo E8400O; RAM - 1.5G, 2G, 4G; i915/ich6, g945/ich7, g45/ich10r. ...
So i don't think , it is about hardware... probably it is about kernel config.
Alexey
Alexey, what version of GCC are you using? If applying the patch in attachment #23468 [details] stops file system corruption, but you're not seeing the message or the kernel stack dump that would be triggered by WARN_ON(1) in your logs, then the conditional is never triggering. So the only thing that might be making a difference is if this is either moving some code around or some kind of compiler bug. Something really doesn't make sense, and there is definitely something very subtle going on here.
I'd appreciate by the way getting reports from other people who have been suffering file system corruption problems whether or not the patch in attachment #23468 [details] helps, and whether they are seeing any "File system is read-only" followed by a kernel stack trace in their system logs. Thanks!!
I use "gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1", you can see it in dmesg. Are there any way to disasm this function? Created attachment 23486 [details]
disasm with patch
This is disasm of ext4_write_inode with patch
Created attachment 23487 [details]
disasm without patch
disasm without patch.
(In reply to comment #87) > (In reply to comment #84) > > Maybe this is a totally stupid suggestion, but is there any chance for this > code path to be executed during journal replay, when s_journal is still NULL > ? If I'm understanding Alexy's test correctly, he's not even doing journal replay; he's doing IO, crashing, and then running e2fsck -f -y immediately upon reboot w/o any fileystem mount (and therefore no journal replay) Alexy, is that correct? Alexey, one other question, after you did this testcase for your bisect run: mount /dev/sda13 /mnt md5deep -r /lib/modules/$k_version/* > /mnt/md5sums umount /mnt sleep 1m echo c > /proc/sysrq-trigger did you reset power to get the machine up and running again? Thanks, -Eric (In reply to comment #101) > Alexy, is that correct? No. ... crash, reboot, mount readonly, chroot on mounted partition and fsck mounted partition. But it is not what about happen after reboot. I used "clean" kernel after dirty crash with same fs corruption. (In reply to comment #102) > did you reset power to get the machine up and running again? Yes i reset the power, suddenly i can't use sysrq to reboot. (In reply to comment #103) > (In reply to comment #101) > > Alexy, is that correct? > > No. ... crash, reboot, mount readonly, chroot on mounted partition and fsck > mounted partition. Oh, so you are e2fsck'ing a readonly mounted root partition... Ok I had misunderstood that part, sorry. Created attachment 23494 [details]
Alexey's x86_64 config.
well, Alexey's config seems to have done the trick (read: hosed my test box) for me. Attached the x86_64 version he sent me. I got some serious multiply-caimed block errors after an apparently failed boot of that kernel.
hrm, or not. I thought the kernel had gotten power-cycled after booting, but since it had no serial console compiled in, I didn't see that it actually apparently didn't boot at all. Hrm. Still, I have seen problems, so I have hope yet for narrowing down a reproducer ... > I'd appreciate by the way getting reports from other people who have been
> suffering file system corruption problems whether or not the patch in
> attachment #23468 [details] helps, and whether they are seeing any "File system is
> read-only" followed by a kernel stack trace in their system logs. Thanks!!
I got some corruption, but not ro changes.
In an early rc nothing seemed wrong until, after a reboot, I saw enough
corruption in a db that I had to dump, clean up several bogus records
and restore it. (It was a busy but ephemeral db, so that wasn't too
much of a problem.
Today, however, running -rc5 the box simply rebooted without notice.
and I got a number of log entries after the reboot. Most specified
ext4_mb_generate_buddy and jbd complained that it spotted a dirty
metadata buffer. There were also a couple of bad header/extents.
I stut down enough to unmoount /var and ran e2fsck. That generated a
*slew* of errors, mostly complaining about multiple files claiming
the same blocks. A second e2fsck right after (with -f) showed no
further errors. I now have about 70 megs of data in lost+found.
The box is only partly up right now (I'm writing this over ssh);
I'll be able to provide more details after a reboot....
James, >I got some corruption, but not ro changes. When you say corruption, you mean "file data corruption", right? You said it was a database; was it a database that uses fsync() --- or put another way, was the database (could you tell us what database it was) one that claims to have ACID properties? >Today, however, running -rc5 the box simply rebooted without notice. and I >got >a number of log entries after the reboot... I stut down enough to >unmount /var and ran e2fsck. That generated a *slew* of errors, mostly >complaining about multiple files claiming the same blocks. A second e2fsck >right after (with -f) showed no further errors. I now have about 70 megs >of data in lost+found. Hmm, this sounds like the patch didn't actually help. And am I right that you never saw the "filesystem is readonly" plus a kernel stack dump in your system logs or in dmesg? The other thing which is interesting is that this happened on a non-root filesystem (/var), which means that journal wasn't replayed when the root filesystem was mounted read-only, but the journal was replayed by e2fsck. Another question --- did you have your file system configured with "tune2fs -c 1", was described in comment #59? One worry I have is that in fact the file system may have been corrupted earlier, and it simply wasn't noticed right away. In the case of fsck complaining about blocks claimed by multiple inodes, there are two causes of this. One is that one or more blocks in the inode table get written to the wrong place, overwriting another block(s) in the inode table. In that case, the pattern of corruption tends to be that since inode N is written on top of inode N+m*16 or N+m*32 (depending on whether you are using 128-byte or 256-byte sized inodes) and inode N+1 is written on top of inode (N+1)+(m+16) or (N+1)+(m*32). it's quite easy to see this pattern from the e2fsck transcript. The second case is one where the block allocation bitmap gets corrupted, such that some blocks which are in use are marked as free, and *then* the file system is remounted and files are written to the file system, such that the blocks are reallocated for new files. In that case, the pattern of the multiply-claimed blocks is random, and it's likely that you will see one or more inodes where the inode is sharing blocks with more than one inodes, and where there is no relationship between the inode numbers of inodes that are using a particular block. So far, the fsck transcripts with pass1b that people have submitted to me tend to be of the second form --- which is why I recommend the use of "tune2fs -c 1"; if the file system corruptions causing data loss are caused by corrupted block allocation blocks, then checking the filesystems after every single boot means that you might see pass 5 failures, but not the pass1b failures that are associated with data loss. Obviously, we don't want to run with "tune2fs -c 1" indefinitely, since that obviously slows down boot times, but for people who are interested in helping us debug this issue, it should allow them to avoid data loss and also help us identify *when* the file system had gotten corrupted (i.e., during the previous boot session), and hopefully allow us to find the root cause to this problem. I got some bad news for you. I have just finished recovering from another semi-severe corruption, this time again on my laptop. This time: 1 - there were no crashes, for last week, and no s2ram/disk was ever attempted. 2 - tune2fs -c 1 was used, and I know for sure that fsck did run every boot (tested without usplash, and with (had very slow boot as expected)) Yet, today, I did usual 'apt-get' update, I updated -git version of NetworkManager I use, and noticed that it won't connect via pptp. So to be sure I rebooted the system, fsck failed, and large corruption is there... So it looks like bitmaps are corrupted during runtime, and crashes aren't the cause. The rest in attached log Created attachment 23514 [details]
fsck log
This is the new log
I was able to reproduce fs corruption by passing barrier=0, _but_ i can't be sure burriers are disabled for my dm. fstab say, disabled: ~$ grep barr /etc/fstab ----------------------------------------------------------------- /dev/mapper/zwerg_buntu-root_one / ext4 errors=remount-ro,barrier=0 0 1 /dev/sda8 /home ext4 defaults,barrier=0 0 2 ------------------------------------------------------------------- proc say, disabled: $ grep barr /proc/mounts ---------------------------------------------------------- /dev/mapper/zwerg_buntu-root_one / ext4 rw,relatime,errors=remount-ro,barrier=0,data=ordered 0 0 /dev/sda8 /home ext4 rw,relatime,barrier=0,data=ordered 0 0 ------------------------------------------------------------- dmesg say, _enabled_: kernel 2.6.31-14-generic ~$ dmesg | grep barr [ 7.063589] EXT4-fs (dm-0): barriers enabled [ 9.365835] EXT4-fs (sda8): barriers disabled kernel 2.6.32-rc5-00011-g2fdc246 [ 9.365835] EXT4-fs (sda8): barriers disabled i know 2.6.32-rc do not printk if barriers are enabled, so i assume they are still enabled. I think this mess with dmesg vs proc should be fixed too. apt-get update && apt-get upgrade && reboot is really good way to reproduce fs corruption. After this you need reinstall your system. Are here only debian and ubuntu? any other distributions? Reply-To: pavel@ucw.cz Hi! >>> So I have been experimenting with various root file systems on my >>> laptop running latest git. This laptop some times has problems waking >>> up from sleep and that results in it needing a hard reset and >>> subsequently unclean file system. >>> >> A number of people have reported this, and there is some discussion >> and some suggestions that I've made here: >> >> http://bugzilla.kernel.org/show_bug.cgi?id=14354 >> >> It's been very frustrating because I have not been able to replicate >> it myself; I've been very much looking for someone who is (a) willing >> to work with me on this, and perhaps willing to risk running fsck >> frequently, perhaps after every single unclean shutdown, and (b) who >> can reliably reproduce this problem. On my system, which is a T400 >> running 9.04 with the latest git kernels, I've not been able to >> reproduce it, despite many efforts to try to reproduce it. (i.e., >> suspend the machine and then pull the battery and power; pulling the >> battery and power, "echo c> /proc/sysrq-trigger", etc., while >> doing "make -j4" when the system is being uncleanly shutdown) >> > > I wonder if we might have better luck if we tested using an external > (e-sata or USB connected) S-ATA drive. > > Instead of pulling the drive's data connection, most of these have an > external power source that could be turned off so the drive firmware > won't have a chance to flush the volatile write cache. Note that some > drives automatically write back the cache if they have power and see a > bus disconnect, so hot unplugging just the e-sata or usb cable does not > do the trick. > > Given the number of cheap external drives, this should be easy to test > at home.... Do they support barriers? (Anyway, you may want to use some kind of VM for testing. That should make the testing cycle shorter, easier to reprorduce *and* more repeatable.) Pavel Reply-To: rwheeler@redhat.com On 10/25/2009 02:22 AM, Pavel Machek wrote: > Hi! > >>>> So I have been experimenting with various root file systems on my >>>> laptop running latest git. This laptop some times has problems waking >>>> up from sleep and that results in it needing a hard reset and >>>> subsequently unclean file system. >>>> >>> A number of people have reported this, and there is some discussion >>> and some suggestions that I've made here: >>> >>> http://bugzilla.kernel.org/show_bug.cgi?id=14354 >>> >>> It's been very frustrating because I have not been able to replicate >>> it myself; I've been very much looking for someone who is (a) willing >>> to work with me on this, and perhaps willing to risk running fsck >>> frequently, perhaps after every single unclean shutdown, and (b) who >>> can reliably reproduce this problem. On my system, which is a T400 >>> running 9.04 with the latest git kernels, I've not been able to >>> reproduce it, despite many efforts to try to reproduce it. (i.e., >>> suspend the machine and then pull the battery and power; pulling the >>> battery and power, "echo c> /proc/sysrq-trigger", etc., while >>> doing "make -j4" when the system is being uncleanly shutdown) >>> >> >> I wonder if we might have better luck if we tested using an external >> (e-sata or USB connected) S-ATA drive. >> >> Instead of pulling the drive's data connection, most of these have an >> external power source that could be turned off so the drive firmware >> won't have a chance to flush the volatile write cache. Note that some >> drives automatically write back the cache if they have power and see a >> bus disconnect, so hot unplugging just the e-sata or usb cable does not >> do the trick. >> >> Given the number of cheap external drives, this should be easy to test >> at home.... > > Do they support barriers? > > (Anyway, you may want to use some kind of VM for testing. That should > make the testing cycle shorter, easier to reprorduce *and* more repeatable.) > > Pavel > The drives themselves will support barriers - they are the same S-ATA/ATA drives you get normally for your desktop, etc. I think that e-SATA would have no trouble (but fewer boxes have that external S-ATA port). Not sure how reliable the SCSI -> USB -> ATA conversion is for USB drives though (a lot of moving pieces there!). VM testing is a good idea, but I worry that the virtual IO stack support for data integrity is still somewhat shaky. Christoph was working on fixing various bits and pieces I think... ric Just a note: I'm seeing this too on my new Dell laptop, with a fresh Fedora-12 Beta install (+ "yum update" + current -git kernel). The root filesystem was seriously scrogged, and I had the RPM databases corrupted. Don't know what else is wrong, I'm rebuilding them now. I had a few unclean shutdowns (debugging wireless driver etc), but they weren't during any heavy disk activity. And I'm pretty sure they weren't during any yum update, so the rpm database corruption smells like ext4 is not writing back inode information in a timely manner at all (ie any rpm database activity would have happened much earlier). Example kernel messages appended. I probably missed a few. I do not have fsck logs (this was all on the root filesystem, nowhere to log them), but they weren't pretty. I can do pretty much anything to this laptop (I'm just testing it), so I'm open to testing. Linus --- Oct 25 10:46:00 localhost kernel: device-mapper: multipath: version 1.1.0 loaded Oct 25 10:46:00 localhost kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode Oct 25 10:46:00 localhost kernel: scsi 4:0:0:0: Direct-Access Generic- Multi-Card 1.00 PQ: 0 ANSI: 0 CCS Oct 25 10:46:00 localhost kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0 Oct 25 10:46:00 localhost kernel: sd 4:0:0:0: [sdb] Attached SCSI removable disk Oct 25 10:46:00 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 2: 5553 blocks in bitmap, 5546 in gd Oct 25 10:46:00 localhost kernel: JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. ... Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089 Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089 Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089 Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089 Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089 Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089 Oct 25 10:46:56 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 133089 ... Oct 25 10:46:59 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 17: 3663 blocks in bitmap, 3664 in gd ... Oct 26 07:54:58 localhost kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 3: 4789 blocks in bitmap, 3785 in gd Oct 26 07:54:58 localhost kernel: JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 5832 Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482 Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485 Oct 26 07:54:58 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486 ... Oct 26 07:55:22 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 657981 Oct 26 07:55:22 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 657981 ... Oct 26 08:11:39 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486 Oct 26 08:11:40 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482 Oct 26 08:11:41 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485 Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 6: 6342 blocks in bitmap, 6330 in gd Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 16: 17220 blocks in bitmap, 18932 in gd Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 26: 6769 blocks in bitmap, 4721 in gd Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 27: 15419 blocks in bitmap, 1081 in gd Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 28: 8309 blocks in bitmap, 1879 in gd Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 29: 5982 blocks in bitmap, 1880 in gd Oct 26 08:12:11 localhost kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 30: 9476 blocks in bitmap, 3886 in gd Oct 26 08:12:12 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482 Oct 26 08:12:12 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485 Oct 26 08:12:12 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486 Oct 26 08:12:14 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21482 Oct 26 08:12:14 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21485 Oct 26 08:12:14 localhost kernel: EXT4-fs error (device dm-0): ext4_lookup: deleted inode referenced: 21486 Linus, are you seeing the fsck & runtime errors after a clean shutdown, or only after a crash/reset/poweroff ? Or both? (we seem to potentially have reports of both...) Thanks, -Eric It seems to be only after an unclean shutdown for me, judging by some judicious grepping of /var/log/messgaes. At least the ext4 error messages in dmesg only seem to happen when I don't see the line "Kernel logging (proc) stopped." in the messages file before the new boot. Of course, maybe the problem is caused by - or even just made visible by - not the unclean shutdown itself, but the fsck or the journal replay which is just triggered by the unclean shutdown. IOW, it could be secondary damage. That sounds unlikely, though. I dunno. As mentioned, I don't think that I was doing anything particularly write-heavy when I forced the unclean shutdown. (In reply to comment #119) > It seems to be only after an unclean shutdown for me, judging by some > judicious > grepping of /var/log/messgaes. At least the ext4 error messages in dmesg only > seem to happen when I don't see the line "Kernel logging (proc) stopped." in > the messages file before the new boot. > > Of course, maybe the problem is caused by - or even just made visible by - > not > the unclean shutdown itself, but the fsck or the journal replay which is just > triggered by the unclean shutdown. IOW, it could be secondary damage. That > sounds unlikely, though. I thought that it could be the cause, but in one case of an unclean shutdown, I rebooted into 2.6.31.1, hoping I could avoid the damage, but the fs ended up corrupted anyway. However, I didn't run fsck on each boot, so I cannot be sure the fs has been clean before. (In reply to comment #89) > Created an attachment (id=23458) [details] > skip the non-journal branch if readonly I'm skeptical of reports that this patch fixed things as well; from code inspection and instrumented testing, I don't see how we get here w/ s_journal NULL on a journalling filesystem mount, even during recovery... Can you test with the below patch Commit 487caeef9fc08c0565e082c40a8aaf58dad92bbb fixed a deadlock between get_block and truncate by dropping the i_data_sem during truncate. But i guess we should make sure we drop any preallocation that we did as a part of the block allocation. Otherwise we definitely can get multiple claimed blocks. But this can happen only if we have block allocation and truncate happening on the same file simultaneously. With delayed allocation i guess it is possible that pdflush tries to do block allocation while application does a truncate diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 10539e3..14539b5 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -113,6 +113,7 @@ static int ext4_ext_truncate_extend_restart(handle_t *handle, */ ext4_ext_invalidate_cache(inode); + ext4_discard_preallocations(inode); return err; } Can you try this patch ? commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65 Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Date: Tue Oct 27 15:36:38 2009 +0530 ext4: discard preallocation during truncate We need to make sure when we drop and reacquire the inode's i_data_sem we discard the inode preallocation. Otherwise we could have blocks marked as free in bitmap but still belonging to prealloc space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5c5bc5d..a1ef1c3 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -209,6 +209,12 @@ static int try_to_extend_transaction(handle_t *handle, struct inode *inode) up_write(&EXT4_I(inode)->i_data_sem); ret = ext4_journal_restart(handle, blocks_for_truncate(inode)); down_write(&EXT4_I(inode)->i_data_sem); + /* + * We have dropped i_data_sem. So somebody else could have done + * block allocation. So discard the prealloc space created as a + * part of block allocation + */ + ext4_discard_preallocations(inode); return ret; } I tested this two patches together, after first crush i get fs corruption: "Block bitmap differences" and "blocks count wrong". The strange thing is - the journal patch still working for me. I didn't get any evidence it hitting "w/ s_journal NULL", so i will continue killing my system with this patch, will see if i get any fs corruptions. You need to test only with patch in Comment #123 Also can you make sure you start with a clean file system before the test. ie fsck and make sure file system is clean. Then boot the new kernel and run the test ? -aneesh *** Bug 14422 has been marked as a duplicate of this bug. *** (In reply to comment #125) my system set up to check fs on every boot. I confirm, it was clean. Now i testing patch in Comment #123, first crash was clean.. i'll do more crashes... :) So i got corruption with this patch too: "i_blocks... i_size should be", "multiply claimed blocks" ... I got corruption o third crush. Created attachment 23550 [details]
fsck with journal corruption
This attachment is about my previous test i use to get fs corruption (make install(3 times) && get oops && restart) with only one difference, to run fsck i use "-o ro,noload" to mount root. before i use "-o ro". Now i can see journal corruption wich was hidden by kernel before. Probably this is the cause of this issue. Some more testing of this issue: - if I do something like [root@inspiron-laptop linux]# sync [root@inspiron-laptop linux]# echo b > /proc/sysrq-trigger I never seem to see any problems. I've done the above several times - I started using it as my shutdown replacement, and I'm bisecting another independent issue on the same laptop. And no problems so far. - but when I did the whole power-button thing even when the system looked otherwise idle (it was hung at shutdown, and I waited ten+ seconds), I get in trouble very easily. This makes me suspect that the background flush by bdflush simply isn't working well for ext4. Even when you'd expect things to be idle, and bdflush should hopefully get your metadata out, it just doesn't. I'd suspect the 'wait=0' case in ext4_write_inode. It does the same 'don't do anything' that ext3 does, but presumably the delayed allocation makes for much worse trouble. Notice how just returning zero from ext4_write_inode() will make the general fs/fs-writeback.c just remove the inode from the dirty lists entirely. One thing to note is that the ext4_write_inode() behavior certainly isn't new, but it's entirely possible that the per-bdi writeback logic has changed timing of the writebacks a lot. Which may explain why it's apparently getting more attention in the 2.6.32-rc series, although some people seem to have seen it earlier too. A few other observations from testing today. I'm driving a test on a lab box by doing a make modules_install into /tmp/modules on the victim box, and cycling power at some random point while that runs. 2.6.32 seems to hit this reliably, with the root fs pretty well munged. 2.6.31 hasn't seen it hit yet in my testing, maybe Linus' ext4_write_inode vs. writeback theory is correct ... Perhaps more strange, doing the same test on a non-root fs under 2.6.32 also doesn't seem to hit it reliably. Could it be something about the remount,ro flush of the root fs on the way down? Suspecting that possibly "mount -o ro; e2fsck -a /dev/root" during bootup was causing problems by writing to the mounted fs, I short-circuited the boot-time fsck -a; things were still badly corrupted so that doesn't seem to be it. (In reply to comment #132) > Perhaps more strange, doing the same test on a non-root fs under 2.6.32 also > doesn't seem to hit it reliably. Could it be something about the remount,ro > flush of the root fs on the way down? er, no, that doesn't make sense in my poweroff scenario does it! Sigh. On Tue, 27 Oct 2009, in comment #132 from Eric Sandeen <sandeen@redhat.com> wrote: > > Perhaps more strange, doing the same test on a non-root fs under 2.6.32 also > doesn't seem to hit it reliably. Could it be something about the remount,ro > flush of the root fs on the way down? > > Suspecting that possibly "mount -o ro; e2fsck -a /dev/root" during bootup was > causing problems by writing to the mounted fs, I short-circuited the > boot-time > fsck -a; things were still badly corrupted so that doesn't seem to be it. It certainly isn't about the 'remount,ro' on the way down, since that's the part you avoid entirely in a non-clean shutdown. But it could easily be something special about mounting the root filesystem, together with bad interaction with 'fsck'. Non-root filesystems will be fsck'd _before_ being mounted, but the root filesystem will be fsck'd _after_ the mount. If the initial root ro-mount causes the filesystem recovery, and fsck then screws things up, then "root filesystem is special" might well trigger. It might explain why Ted and others are unable to re-create this - maybe they are being careful, and do ext4 testing with a non-ext4 root? Example issues that are exclusive to the root filesystem and would never be an issue on any other filesystem (exactly due to the "mount ro first, fsck later" behavior of root): - flaky in-kernel recovery code might trash more than it fixes, and would never trigger for the "fsck first" case because fsck would already have done it. - flaky user-mode fsck doesn't understand that the kernel already did recovery, and re-does it. - virtually indexed caches might be loaded by the mount, and when you do fsck later, the fsck writes back through the physically indexed direct device. So the mounted root filesystem may never see those changes, even after you re-mount it 'rw'. - even if every filesystem cache is physically indexed (ie using the buffer cache rather page cache), there may be various cached values that the kernel keeps around in separate caches, like the superblock compatibility bits, free block counts, etc. fsck might change them, but does 'remount,rw' always re-read them? None of the four cases above are issues for a filesystem that isn't mounted before fsck runs. Linus (In reply to comment #134) > Example issues that are exclusive to the root filesystem and would never > be an issue on any other filesystem (exactly due to the "mount ro first, > fsck later" behavior of root): > > - flaky in-kernel recovery code might trash more than it fixes, and would > never trigger for the "fsck first" case because fsck would already have > done it. Yep, but I didn't have auto-fsck in my non-root testing, and I did "mount -o ro" followed by "e2fsck -a" on it to try to "be like /", and didn't see it. Though I'll re-check just to be sure. > - flaky user-mode fsck doesn't understand that the kernel already did > recovery, and re-does it. It should print out a message if it's recovering again, we should see it in the logs... and I don't think I have. > - virtually indexed caches might be loaded by the mount, and when you do > fsck later, the fsck writes back through the physically indexed direct > device. So the mounted root filesystem may never see those changes, > even after you re-mount it 'rw'. Yes, I've been worried about this... but would this be new? Could probably test this theory by forcing an immediate reboot if e2fsck does recovery on a readonly mounted fs. > - even if every filesystem cache is physically indexed (ie using the > buffer cache rather page cache), there may be various cached values > that the kernel keeps around in separate caches, like the superblock > compatibility bits, free block counts, etc. fsck might change them, but > does 'remount,rw' always re-read them? same test above would help determine this, I think. One thing to note that my notebook has separate '/' and '/home' filesystems. Both are ext4, and I usually see corruption only on the '/' However, much less often, but still I do see similar corruption within /home. Alexey Fisher wrote:
> This attachment is about my previous test i use to get fs corruption (make
> install(3 times) && get oops && restart) with only one difference, to run
> fsck
> I use "-o ro,noload" to mount root. before i use "-o ro". Now i can see
> journal
> corruption wich was hidden by kernel before.
This is not a valid test. Mounting with "-o noload" will discard all of the
transaction information in the journal, and virtually guarantee the filesystem
is inconsistent. It will be no better than ext2, which requires a full e2fsck
run after each crash. This is NOT a valid reproducer for this problem.
(In reply to comment #137) > This is not a valid test. Mounting with "-o noload" will discard all of the > transaction information in the journal, and virtually guarantee the > filesystem > is inconsistent. It will be no better than ext2, which requires a full > e2fsck > run after each crash. This is NOT a valid reproducer for this problem. Grr... i know my english is not good, but is it really hard to understand why i mean? "noload" is not used to reproduce crash! To crash i use default options of distribution. In this bug i do not trust distribution to run fsck, so i do it manually. I start in to initrd, so root is not mounted! Then i mount it manually to be sure it is readonly. Normally i mounted on this stage with option "-o ro". The result of thees was - we newer so "jurnal corruption", because the kernel silently "repaired" it. Now i use "-o ro,noload" to mount root and run fsck (not to reproduce crush). And now i can see, journal is not corrupted after normal crush. If yournall is corrupt all fs is corrupt too. Now is the question: do we use journal to recover fs? if we use broken journal how this recovery will look like? Do we get this "multiply claimed blocks" because we get wrong information from journal and this is the reason, why some times files which was written long time before are corrupt too? I think he meant that it clears the journal when you do that. If that's the case, then you can't confirm something went wrong using your method because for all you know the journal was correct and would have repaired everything, but now it's been cleared and fsck won't have the chance to replay it. But yeah, if noload just leaves the journal untouched so fsck can later have a look at it, then what you tried makes sense. What is the best way to reproduce this bug? I have set up a VM, and did reproduce it there, however, it reproduces very rarely.... :-( Maxim, I'm reproducing it like this: #!/bin/bash echo "clearing & reinstalling modules" mkdir -p /tmp/modules rm -rf /tmp/modules/* mount /dev/sdb4 /mnt/test2/ cd /mnt/test2/linux-2.6 while true; do make -j2 MODLIB=/tmp/modules modules_install done where /mnt/test2/linux-2.6 is just a recent kernel tree. I then cut power to the box in the middle of the install, reboot, and see what shape I'm in. I had been cutting power after a random amount of time but I think it's probably best to catch it while the modules are being actively copied over, rather than say while depmod is running, but I'm not 100% certain about that. -Eric Eric Sandeen, Thanks! I tried similar approaches, I copied lots of files, I created/removed random files, I tried to use apt-get upgrade, many things. Sometimes I see corruption, but very rarely.... Since I feel more and more frustrated with this VM, and feel that this is a dead end, I removed it, and now I think about some way to reproduce on real system without mass corruption. On Tue, 2009-10-27 at 15:45 +0530, Aneesh Kumar K.V wrote: > Can you try this patch ? > > commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65 > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Date: Tue Oct 27 15:36:38 2009 +0530 > > ext4: discard preallocation during truncate > > We need to make sure when we drop and reacquire the inode's > i_data_sem we discard the inode preallocation. Otherwise we > could have blocks marked as free in bitmap but still belonging > to prealloc space. > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Make sense, reviewed-by: Mingming Cao <cmm@us.ibm.com> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 5c5bc5d..a1ef1c3 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -209,6 +209,12 @@ static int try_to_extend_transaction(handle_t *handle, > struct inode *inode) > up_write(&EXT4_I(inode)->i_data_sem); > ret = ext4_journal_restart(handle, blocks_for_truncate(inode)); > down_write(&EXT4_I(inode)->i_data_sem); > + /* > + * We have dropped i_data_sem. So somebody else could have done > + * block allocation. So discard the prealloc space created as a > + * part of block allocation > + */ > + ext4_discard_preallocations(inode); > > return ret; > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Reply-To: parag.lkml@gmail.com On Tue, Oct 27, 2009 at 6:15 AM, Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> wrote: > Can you try this patch ? > > commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65 > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> > Date: Â Tue Oct 27 15:36:38 2009 +0530 > > Â Â ext4: discard preallocation during truncate > > Â Â We need to make sure when we drop and reacquire the inode's > Â Â i_data_sem we discard the inode preallocation. Otherwise we > Â Â could have blocks marked as free in bitmap but still belonging > Â Â to prealloc space. Just wanted to let you know that I have applied this patch and one unclean shutdown later it seems to have not given me any trouble. I will continue testing it - hopefully I won't have to reformat this time ( every time I tested previously I ended up having weird issues that I decided to get rid of by reformatting /). Parag Parag Warudkar wrote: > On Tue, Oct 27, 2009 at 6:15 AM, Aneesh Kumar K.V > <aneesh.kumar@linux.vnet.ibm.com> wrote: >> Can you try this patch ? >> >> commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65 >> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >> Date: Tue Oct 27 15:36:38 2009 +0530 >> >> ext4: discard preallocation during truncate >> >> We need to make sure when we drop and reacquire the inode's >> i_data_sem we discard the inode preallocation. Otherwise we >> could have blocks marked as free in bitmap but still belonging >> to prealloc space. > > Just wanted to let you know that I have applied this patch and one > unclean shutdown later it seems to have not given me any trouble. > > I will continue testing it - hopefully I won't have to reformat this > time ( every time I tested previously I ended up having weird issues > that I decided to get rid of by reformatting /). I've been running my testcase, and I just hit the usual corruption with this patch in place after 8 iterations, I'm afraid. -Eric > Parag > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html On Thu, Oct 29, 2009 at 05:25:17PM -0400, Parag Warudkar wrote:
> > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> > Date: Â Tue Oct 27 15:36:38 2009 +0530
> >
> > Â Â ext4: discard preallocation during truncate
> >
> Just wanted to let you know that I have applied this patch and one
> unclean shutdown later it seems to have not given me any trouble.
>
> I will continue testing it - hopefully I won't have to reformat this
> time ( every time I tested previously I ended up having weird issues
> that I decided to get rid of by reformatting /).
Cool!
It looked like Avery Fisher had reported that had had tried patch
didn't help him, and so I had put it in the category of "good patch,
makes sense, but probably not the one that would solve this
regression". But if it works for you, I'll accelerate getting it to
Linus. Do let us know if you have any additional problems with this
patch, please!
- Ted
Reply-To: parag.lkml@gmail.com On Thu, Oct 29, 2009 at 5:42 PM, Theodore Tso <tytso@mit.edu> wrote: > On Thu, Oct 29, 2009 at 05:25:17PM -0400, Parag Warudkar wrote: >> > Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >> > Date:  Tue Oct 27 15:36:38 2009 +0530 >> > >> >   ext4: discard preallocation during truncate >> > >> Just wanted to let you know that I have applied this patch and one >> unclean shutdown later it seems to have not given me any trouble. >> >> I will continue testing it - hopefully I won't have to reformat  this >> time ( every time I tested previously I ended up having weird issues >> that I decided to get rid of by reformatting /). > > Cool! > > It looked like Avery Fisher had reported that had had tried patch > didn't help him, and so I had put it in the category of "good patch, > makes sense, but probably not the one that would solve this > regression".  But if it works for you, I'll accelerate getting it to > Linus.  Do let us know if you have any additional problems with this > patch, please! > I have gone through 3 unclean shutdowns so far - this time on Ubuntu Karmic Release, today's -git and this patch. Nothing odd to report. [ Goes and crashes it again - comes back clean! ] Since literally all of my previous attempts always resulted in problems - something seems to have improved although there is no way to tell if this is going to last long or that if it was Aneesh's patch alone that improved things as last time i tried this I ended up reinstalling and I switched to Karmic from Jaunty and I did not test mainline for few days. Any rate, I will let everyone know if I see anything different. Thanks Parag >> - virtually indexed caches might be loaded by the mount, and when you do >> fsck later, the fsck writes back through the physically indexed direct >> device. So the mounted root filesystem may never see those changes, >> even after you re-mount it 'rw'. > >Yes, I've been worried about this... but would this be new? Eric, Linus, If fsck ever modifies the filesystem, it sets bit 0 in the exit status code. For the root file system, bit 1 is set as well. To quote the fsck man page: The exit code returned by fsck is the sum of the following conditions: 0 - No errors 1 - File system errors corrected 2 - System should be rebooted 4 - File system errors left uncorrected 8 - Operational error 16 - Usage or syntax error 32 - Fsck canceled by user request 128 - Shared library error The exit code returned when multiple file systems are checked is the bit-wise OR of the exit codes for each file system that is checked. Distribution init scripts are supposed to force a reboot when the root file system is checked, and fsck returns a non-zero exit code. This should avoid the problem that Linus is afraid of. Of course, Ubuntu Karmic *does* have a new set of fancy-shamncy init scripts that is supposed to run (some) fsck's in parallel with the rest of the boot process. Presumably this doesn't include critical file systems such as the root and whatever filesystems might be supplying /var or /usr. But maybe Karmic introduced a bug and the init scripts aren't forcing a reboot after fsck prints "*** REBOOT LINUX ***" and returns with an exit status code of 3 (file system errors correct, system should be rebooted). Note though that this shouldn't happen on a simple unclean shutdown, since the journal replay takes place before the root file system is initially mounted (even read-only). [n.b. The fact that Avery was using -o noload, is not normal, and I'm not sure what he was trying to test.] Avery, >In this bug i do not trust distribution to run fsck, so i do it manually. I >start in to initrd, so root is not mounted! Then i mount it manually to be >sure >it is readonly. Normally i mounted on this stage with option "-o ro". The >result of thees was - we newer so "jurnal corruption", because the kernel >silently "repaired" it. I think there's some confusion here. If there is journal corruption, you would see a complaint in the dmesg logs after the root filesystem is mounted. One of the changes which did happen as part of 2.6.32-rc1 is that journal checksums are enabled by default. This is a *good* thing, since if there are journal corruptions, we find out about them. Note that in this case, journal corruption usually means an I/O error has occurred, causing the checksum to be incorrect. Now, one of the problem is that we don't necessarily have a good recovery path if the journal checksums don't check out, and the journal replay is aborted. In some cases, finishing the journal replay might actually cause you to be better off, since it might be that the corrupted block occurs later in the journal in a correct fashion, and a full journal replay might cause you to be lucky. In contrast, aborting the journal might end up with more corruptions for fsck to fix. >Now i use "-o ro,noload" to mount root and run fsck (not to reproduce crush). >And now i can see, journal is not corrupted after normal crush. If yournall is >corrupt all fs is corrupt too. OK, so "mount -o ro,noload" is not safe, and in fact the file system can be inconsistent if you don't replay the journal. That's because after the transaction is committed, the filesystem starts staging writes to their final location to disk. If we crash before this is complete, the file system will be inconsistent, and we have to replay the journal in order to make the file system be consistent. This is why we force a journal replay by default, even when mounting the file system read-only. It's the only way to make sure the file system is consistent enough that you can even *run* fsck. (Distributions that want to be extra paranoid should store fsck and all of its needed files in the initrd, and then check the root filesystem before mounting it ro, but no distro does this as far as I know.) (Also, in theory, we could implement a read-only mode where we don't replay the journal, but instead read the journal, figure out all of the blocks that would be replayed, and then intercept all reads to the file system such that if the block exists in the journal, we use the most recent version of the block in the journal instead of the physical location on disk. This is far more complicated than anyone has had time/energy to write, so for now we take the cop-out route of replaying the journal even when the filesystem is mounted read-only. If you do mount -o ro,noload, you can use fsck to clean the filesystem, but per my previous comment in this bug report, you *must* reboot in this case afterwards, since fsck modifies the mounted root filesystem to replay the journal, and there may be cached copies of the filesystem that are modified in the course of the replay of the journal by fsck. (n.b., it's essentially the same code that is used to replay the journal, regardless of whether it is in the kernel or in e2fsck; the journal recovery.c code is kept in sync between the kernel and e2fsprogs sources.) >Now is the question: do we use journal to recover fs? if we use broken journal >how this recovery will look like? Do we get this "multiply claimed blocks" >because we get wrong information from journal and this is the reason, why some >times files which was written long time before are corrupt too? Again, I think there is some confusion here. If the journal is "broken" (aka corrupted) you will see error messages from the kernel and e2fsck when it verifies the per-commit checksum and notices a checksum error. It won't know what block or blocks in the journal were corrupted, but it will know that one of the blocks in a particular commit must have been incorrectly written, since the commit checksum doesn't match up. In that case, it can be the cause of file system corruption --- but it's hardly the only way that file system corruption can occur. There are many other ways it could have happened as well. It seems unlikely to me that the "multiply claimed blocks" could be caused directly by wrong information in the journal. As I said earlier, what is most likely is the block allocation bitmaps are getting corrupted, and then the file system was mounted read/write and new files written to it. It's possible the bitmaps could have been corrupted by a journal replay, but that's only one of many possible ways that bitmap blocks could have gotten corrupted, and if the journal had been corrupted, there should have been error messages about journal checksums not being valid. Created attachment 23592 [details]
console log w/ aneesh's patch
Well, now that I look again, w/ Aneesh's patch it didn't show up the multiply-claimed blocks this time, but did still have pretty massive corruption. Probably needs a few more test runs to make sure that's not a fluke.
Attached is a console log of the fs errors runtime, followed by the boot-time fsck.
-Eric
On Thu, Oct 29, 2009 at 04:38:57PM -0500, Eric Sandeen wrote: > > I've been running my testcase, and I just hit the usual corruption with > this patch in place after 8 iterations, I'm afraid. Eric, since you have a relatively controllable reproduction case, have you tried reproducing Alexey's bisection results? Specifically he seemed to find that commit fe188c0e shows no problem, and commit 91ac6f43 is the first commit with problems? Having to do multiple iterations will make doing a bisection a major pain, but maybe we'll get something out of that. Other things that might be worth doing given that you have a test case would be to try reverting commit 91ac6f43, and see if that helps, and to try this patch: http://bugzilla.kernel.org/attachment.cgi?id=23468 Or have you tried some of these experiments already? Regards, - Ted Ted, Thank you for explanation :) Notice: i learning computer forensic, and was trained to mount all evidence systems with "-o ro" to not contaminate it. It seems like ext4 break this tradition, so many forensics will surprised why md5sum do not match. Sandeen, One more thing: i noticed on my system, after fresh system install (partition was reformatted too), is harder to reproduce first corruption as on partition were i had corruption and fixed it with fsck. So there is possibly more false results. Do fs really good fixed after fsck? s/Avery/Alexey/g Hi Linus, Eric, There are two changes that went in after 2.6.31 that i am doubtful. 0d34ec62e18984ac9476208660372306ef54e70d That changes the ext4_file_write to not do a journal commit if inode doesn't have dirty flag set. Jan Kara actually found other issues with the ext4 sync code path and have posted a patch here( MID 1256647729-29834-1-git-send-email-jack@suse.cz ) http://article.gmane.org/gmane.comp.file-systems.ext4/16126. But that should not result in multiple claimed blocks. 487caeef9fc08c0565e082c40a8aaf58dad92bbb The changes allowed to release i_data_sem during truncate. That implies we could allow block allocation while we are doing truncate. I did a patch to drop prealloc space once we reacquire i_data_sem. Posted in comment 123 of this bug. But that didn't fix the corruption. So may be we want to test backing out 487caeef9fc08c0565e082c40a8aaf58dad92bbb and applying patch at http://article.gmane.org/gmane.comp.file-systems.ext4/16126 and see if we are able to reproduce. -aneesh To further update on 487caeef9fc08c0565e082c40a8aaf58dad92bbb, If we happen to create a prealloc space mapping the same offset that we are truncating them we can end up having multiple claimed blocks. ext4_mb_return_to_preallocation which is done during truncate doesn't put the blocks back into prealloc space because it depends on us discarding the prealloc space at the beginning of the truncate. Now whether we can end up having block allocation that can get a prealloc space mapping the same offset that we are truncating is something i haven't yet figured out. We do invalidate all pages before we start a truncate. That would imply flusher threads will not find any of those dirty pages. Also for multiple claimed blocks to happen we should get the same block number that we are freeing via truncate. AFAIU that cannot happen because bits mapping those blocks are marked used in buddy cache. So we cannot preallocate them. But i guess discarding prealloc space after reacquiring i_data_sem seemed to be a safe bet to me. But still doesn't fully explain why we end up having multiple claimed blocks. May be i am missing some code paths in the above analysis. -aneesh Theodore Tso wrote: > On Thu, Oct 29, 2009 at 04:38:57PM -0500, Eric Sandeen wrote: >> I've been running my testcase, and I just hit the usual corruption with >> this patch in place after 8 iterations, I'm afraid. > > Eric, since you have a relatively controllable reproduction case, have > you tried reproducing Alexey's bisection results? Specifically he > seemed to find that commit fe188c0e shows no problem, and commit > 91ac6f43 is the first commit with problems? I can try it but I have very little faith in that result to be honest. > Having to do multiple iterations will make doing a bisection a major > pain, but maybe we'll get something out of that. Well I've been doing bisects but I'm getting skeptical of the results; either my testcase isn't reliable enough or all the merges are confusing git-bisect (?) Anyway it keeps ending up on nonsensical commits. > Other things that might be worth doing given that you have a test case > would be to try reverting commit 91ac6f43, and see if that helps, and > to try this patch: http://bugzilla.kernel.org/attachment.cgi?id=23468 > > Or have you tried some of these experiments already? > > Regards, > > - Ted After talking to Aneesh last night, I think other good spot-checks will be to revert 487caeef9fc08c0565e082c40a8aaf58dad92bbb, and to test Jan's sync patches. -Eric Reverting 487caeef9fc08c0565e082c40a8aaf58dad92bbb (that i_sem change from Jan) on .32-rc5 still leads to significant corruption in my testing, so it doesn't seem to be the (only?) regression. (that corruption includes many multiply-claimed blocks, FWIW). On Fri, 30 Oct 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > > Well I've been doing bisects but I'm getting skeptical of the results; > either my testcase isn't reliable enough or all the merges are confusing > git-bisect (?) Anyway it keeps ending up on nonsensical commits. Merges don't confuse git-bisect (lots of merges may make bisection slightly less _efficient_, though, because it can be impossible to find commits that are nicely "in the middle"), but the way bisection works, even a _single_ wrong "git bisect good/bad" statement can - and will - make bisection results go off totally in the weeds. Bisection is designed to be efficient, which is absolutely wonderful when you have a 100% unambiguous bug that needs to be found in 10,000 commits. It's what gives us that nice log2() behavior. But the downside of that efficiency is the same as the downside of any very efficient data compression with no checksum: there is absolutely zero redundancy. So then if even just a single bit of information (in this case the "good/bad" info) is wrong, the end result will not just be wrong - it will be _disastrously_ wrong, and you won't even be anywhere _close_ to the correct end result. [ In fact, the way bisection works, if you give a single incorrect bisection point, the end result will usually be close to the commit you incorrectly marked, rather than the actual buggy commit you searched for. That _can_ be used as a hint when you're unsure - when getting a clearly bogus bisection result, you can look at the bisect log, and try to find the first commit in the log that is close to the clearly bogus result, and you might decide to re-test that commit more carefully. And then if, on re-testing, it turns out that you had been wrong the first time, you might be able to avoid re-doing the whole bisection thing, and instead re-do the log up to the incorrect - and now fixed - entry, and continue from there. ] That said, if you end up being uncertain about the bisection points, and can't give more than a "95% confidence", bisection will almost inevitably fail. A bug that is timing-dependent can be almost impossible to bisect. Even if you can reliably trigger it _eventually_, you may have to test so long at each bisection point that bisection becomes practically useless. Linus Thanks Linus, yep I probably just got "lucky" and incorrectly marked a good run. Or, somehow the bisect between two post-.31 commits wound up with .30 in the Makefile, and I *cough* tested the wrong kernel, not expecting the version to move backwards. I'm running some bisects & tests in the background to try to narrow it down while we have our thinking caps on ... FWIW I can confirm that 2.6.32-rc3 fails. Created attachment 23601 [details]
console log of 2.6.32-rc3 corruption
runtime errors & subsequent fsck of corrupted fs w/ my testcase on -rc3
Reply-To: adilger@sun.com On 2009-10-29, at 15:38, Eric Sandeen wrote: > Parag Warudkar wrote: >> On Tue, Oct 27, 2009 at 6:15 AM, Aneesh Kumar K.V >> <aneesh.kumar@linux.vnet.ibm.com> wrote: >>> Can you try this patch ? >>> >>> commit a8836b1d6f92273e001012c7705ae8f4c3d5fb65 >>> Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> >>> Date: Tue Oct 27 15:36:38 2009 +0530 >>> >>> ext4: discard preallocation during truncate >>> >>> We need to make sure when we drop and reacquire the inode's >>> i_data_sem we discard the inode preallocation. Otherwise we >>> could have blocks marked as free in bitmap but still belonging >>> to prealloc space. >> Just wanted to let you know that I have applied this patch and one >> unclean shutdown later it seems to have not given me any trouble. >> I will continue testing it - hopefully I won't have to reformat this >> time ( every time I tested previously I ended up having weird issues >> that I decided to get rid of by reformatting /). > > I've been running my testcase, and I just hit the usual corruption > with this patch in place after 8 iterations, I'm afraid. I wonder if there are multiple problems involved here? Eric, it seems possible that your reproducer is exercising a similar, though unrelated codepath. I would agree with Ted that if this patch is solving the problem for some of the users it is definitely worth landing, even if it ends up not solving all of the problems. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. On Fri, Oct 30, 2009 at 01:56:27PM -0600, Andreas Dilger wrote:
> I wonder if there are multiple problems involved here? Eric, it seems
> possible that your reproducer is exercising a similar, though unrelated
> codepath.
Note that Aneesh has pubished two patches which insert a call to
ext4_discard_preallocations(). One is a patch which inserts it into
fs/inode.c's truncate path (for direct/indirect-mapped inodes) and one
which is patch which inserts it into fs/extents.c truncate path (for
extent-mapped inodes). As near as I can tell both patches are
necessary, and it looks to me like they should be combined into a
single patch, since commit 487caeef9 affects both truncate paths.
Aneesh, do you concur?
Like Andreas, I am suspicious that there may be multiple problems
occurring here, so here is a proposed plan of attack.
Step 1) Sanity check that commit 0a80e986 shows the problem. This is
immediately after the first batch of ext4 patches which I sent to
Linus during the post-2.6.31 merge window. Given that patches in the
middle of this first patch have been reported by Avery as showing the
problem, and while we may have some "git bisect good" revisions that
were really bad, in general if a revision is reported bad, the problem
is probably there at that version and successive versions. Hence, I'm
_pretty_ sure that 0a80e986 should demonstrate the problem.
Step 2) Sanity check that commit ab86e576 does _not_ show the problem.
This commit corresponds to 2.6.31-git6, and there are no ext4 patches
that I pushed before that point. There are three commits that show up
in response to the command "git log v2.6.31..v2.6.31-git6 -- fs/ext4
fs/jbd2", but they weren't pushed by me. Although come to think of
it, Jan Kara's commit 0d34ec62, "ext4: Remove syncing logic from
ext4_file_write" is one we might want to look at very carefully if
commit ab86e576 also shows the problem....
Step 3) Assuming that Step 1 and Step 2 are as I expect, with commit
ab86e576 being "good", and commit 0a80e986 being "bad", we will have
localized the problem commit(s) to the 63 commits that were initially
pushed to Linus during the merge window. One of the commits is
487caeef9, which Aneesh has argued convincingly seems to be
problematic, and which seems to solve at least one or two reporter's
problems, but clearly isn't a complete solution. So let's try to
narrow things down further by testing this branch:
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git test-history
This branch corresponds to commit ab86e576 (from Step 2), but with the
problematic commit 487caeef9 removed. It was generated by applying
the following guilt patch series to v2.6.31-git6:
git://repo.or.cz/ext4-patch-queue.git test-history
The advantage of starting with the head of test-history is that if
there are multiple problematic commits, this should show the problem
(just as reverting 487caeef9 would) --- but since 487caeef9 is
actually removed, we can now do a "git bisect start test-history
v2.6.31-git6" and hopefully be able to localize whatever additional
commits might be bad.
(We could also keep applying and unapplying the patch corresponding to
the revert of 487caeef9 while doing a bisection, but that tends to be
error prone.)
Does that sounds like a plan?
- Ted
On Sat, Oct 31, 2009 at 05:15:28AM -0400, Theodore Tso wrote:
> On Fri, Oct 30, 2009 at 01:56:27PM -0600, Andreas Dilger wrote:
> > I wonder if there are multiple problems involved here? Eric, it seems
> > possible that your reproducer is exercising a similar, though unrelated
> > codepath.
>
> Note that Aneesh has pubished two patches which insert a call to
> ext4_discard_preallocations(). One is a patch which inserts it into
> fs/inode.c's truncate path (for direct/indirect-mapped inodes) and one
> which is patch which inserts it into fs/extents.c truncate path (for
> extent-mapped inodes). As near as I can tell both patches are
> necessary, and it looks to me like they should be combined into a
> single patch, since commit 487caeef9 affects both truncate paths.
> Aneesh, do you concur?
>
We need only the patch that drop prealloc space in ext4_truncate_restart_trans
ext4_ext_truncate_extend_restart calls ext4_truncate_restart_trans. So adding
the prealloc space dropping in ext4_truncate_restart_trans should handle both
direct/indirect-mapped inode and extent-mapped inodes.
-aneesh
(In reply to comment #161) > Step 1) Sanity check that commit 0a80e986 shows the problem. This is > immediately after the first batch of ext4 patches which I sent to > Linus during the post-2.6.31 merge window. Given that patches in the > middle of this first patch have been reported by Avery as showing the > problem, and while we may have some "git bisect good" revisions that > were really bad, in general if a revision is reported bad, the problem > is probably there at that version and successive versions. Hence, I'm > _pretty_ sure that 0a80e986 should demonstrate the problem. > > Step 2) Sanity check that commit ab86e576 does _not_ show the problem. > This commit corresponds to 2.6.31-git6, and there are no ext4 patches > that I pushed before that point. There are three commits that show up > in response to the command "git log v2.6.31..v2.6.31-git6 -- fs/ext4 > fs/jbd2", but they weren't pushed by me. Although come to think of > it, Jan Kara's commit 0d34ec62, "ext4: Remove syncing logic from > ext4_file_write" is one we might want to look at very carefully if > commit ab86e576 also shows the problem.... Ok, I'm testing these. I'm really starting to think, though, that this might be a latent problem exposed by other changes; I can't otherwise explain my utter failure at trying to run git-bisect over larger ranges... -Eric Well, the proposed "good" point survived 20 runs. The proposed bad point survived 14 runs, and then showed a problem. Confidence level... well... not great yet. I'll test the good point for more runs, to be sure. Oh.. yehh.. now i see haw bad was my bisection. well, this time I went for many many more runs on each bisect point and came up with: # good: [ab86e5765d41a5eb4239a1c04d613db87bea5ed8] Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 git bisect good ab86e5765d41a5eb4239a1c04d613db87bea5ed8 # bad: [0a80e9867db154966b2a771042e10452ac110e1e] ext4: replace MAX_DEFRAG_SIZE with EXT_MAX_BLOCK git bisect bad 0a80e9867db154966b2a771042e10452ac110e1e # good: [98a56ab382079f777e261e14512cbd4fb2107af4] ext4: Fix spelling typo in the trace format for trace_ext4_da_writepages() # Survived 46 runs git bisect good 98a56ab382079f777e261e14512cbd4fb2107af4 # bad: [b6a758ec3af3ec236dbfdcf6a06b84ac8f94957e] ext4: move ext4_mb_init_group() function earlier in the mballoc.c # Failed in 20 runs git bisect bad b6a758ec3af3ec236dbfdcf6a06b84ac8f94957e # bad: [daea696dbac0e33af3cfe304efbfb8d74e0effe6] ext4: Remove unneeded BUG_ON() in ext4_move_extents() # Failed in 40 runs git bisect bad daea696dbac0e33af3cfe304efbfb8d74e0effe6 # good: [a3710fd1ee8cd542c5de63cf2c39f8912031f867] ext4: fix tracepoint format string warnings # Survived 59 runs git bisect good a3710fd1ee8cd542c5de63cf2c39f8912031f867 # bad: [5f3481e9a80c240f169b36ea886e2325b9aeb745] ext4: fix cache flush in ext4_sync_file # Failed in 40 runs git bisect bad 5f3481e9a80c240f169b36ea886e2325b9aeb745 # bad: [d0646f7b636d067d715fab52a2ba9c6f0f46b0d7] ext4: Remove journal_checksum mount option and enable it by default # Failed in 9 runs git bisect bad d0646f7b636d067d715fab52a2ba9c6f0f46b0d7 So it's looking like a problem with journal checksums. I backed that patch out and am running the testcase overnight to convince myself that it survives without it. The patch itself seems fine - it just changes the default; I'm looking at the journal checksum code to see if I can see any problem. My test overnight ran successfully through > 100 iterations of the test, on a tree checked out just prior to d0646f7b636d067d715fab52a2ba9c6f0f46b0d7. This morning I ran that same tree with the journal checksums enabled via mount option, saw that journal corruption was found by the checksumming code, and immediately after that we saw the corruption. So it is the checksum feature being on which is breaking this for us. Linus, I would recommend reverting d0646f7b636d067d715fab52a2ba9c6f0f46b0d7 for now, at this late stage in the game, and those present on the ext4 call this morning agreed. A few things seem to have gone wrong; for one we should have at least issued a printk when we found a bad journal checksum but we silently continued on thanks to a RDONLY check (and the root fs is mounted readonly...) My hand-wavy hunch about what is happening is that we're finding a bad checksum on the last partially-written transaction, which is not surprising, but if we have a wrapped log and we're doing the initial scan for head/tail, and we abort scanning on that bad checksum, then we are essentially running an unrecovered filesystem. But that's hand-wavy and I need to go look at the code. We lived without journal checksums on by default until now, and at this point they're doing more harm than good, so we should revert the default-changing commit until we can fix it and do some good power-fail testing with the fixes in place. I'll revert that patch and do another overnight test on an up-to-date tree to be sure nothing else snuck in, but this looks to me like the culprit, and I'm comfortable recommending that the commit be reverted for now. Thanks, -Eric if the folks who saw this could revert d0646f7b636d067d715fab52a2ba9c6f0f46b0d7 (or just pull Linus' git tree, because he's reverted it upstream) and retest your scenarios that'd be great. -Eric On Mon, 2 Nov 2009, Eric Sandeen <sandeen@redhat.com> wrote: > > if the folks who saw this could revert > d0646f7b636d067d715fab52a2ba9c6f0f46b0d7 > (or just pull Linus' git tree, because he's reverted it upstream) and retest > your scenarios that'd be great. I've done a few dirty reboots on the laptop that was showing the problem (trying to intentionally blow things up by rebooting in the middle of a kernel compile), and have yet to see the kinds of problems I had originally. Instead of getting problems with "lookup()" finding deleted inodes, I get orphan cleanup during recovery, and the filesystem looks ok. I get truncated files in the compile (and need to remove the bad *.o files), but that's what I'd _expect_ from killing the machine in the middle of a compile. So I think you nailed it with the revert. Linus Eric, many thanks for finding the root cause for this!! I think we still need to apply Aneesh's proposed fix, but I suspect it's responsible for a very tiny percentage of the reported problems, since that requires pdflush racing against a truncate/unlink, which should be a very rare case. It seems that the patch that got reversed was simply the one that flipped journal checksums back to a non-default mount option. In other words, I would still expect to see this is someone mounts ext4 with the journal checksum options. Not a blocker for now, but still something that we need to understand and fix. There's a lot more we need to understand --- including why we weren't seeing a printk message indicating a journal checksum, and which commit was showing the checksum failure, and why we ended up seeing a checksum failure in the first place. If it was the last commit that was being written right before the system crash, the commit block simply should have been missing due to the barrier. But we can do this without having to worry about 2.6.32 being a QA disaster for ext4. :-) (In reply to comment #172) > There's a lot more we need to understand --- including why we weren't seeing > a > printk message indicating a journal checksum, and which commit was showing > the > checksum failure, and why we ended up seeing a checksum failure in the first > place. If it was the last commit that was being written right before the > system crash, the commit block simply should have been missing due to the > barrier. But we can do this without having to worry about 2.6.32 being a QA > disaster for ext4. :-) I am looking for the checksum problem today. I agree, seeing an otherwise-valid commit block w/ a bad checksum is very troubling - barrier infrastructure problems? Or something else, will have to see. I know why we're not seeing a printk, already; it's under a RDONLY test, and the root fs is mounted readonly; I have a patch for that though I wonder what the right choice is when we encounter this corruption - I'm leaning towards treating it like any other corruption and going through ext4_error... ...anyway, still looking at all this today, even w/ the .32 qa disaster averted for now. :) Tidbit; after turning on checksums but disabling any action on a mismatch, and adding a little instrumentation, I got something like: checksum error 1293235, (not) breaking normal scan finish, setting end_transaction to 1293246 (fs/jbd2/recovery.c, 388): do_one_pass: Starting recovery pass 1 (fs/jbd2/recovery.c, 388): do_one_pass: Starting recovery pass 2 (fs/jbd2/recovery.c, 257): jbd2_journal_recover: JBD: recovery, exit status 0, recovered transactions 1293215 to 1293246 IOW, the checksum error was far from the last transaction in the log; it was somewhere in the middle-ish of what needed to be replayed. So this more or less explains the corruption (many transactions were not replayed because recovery was aborted when the bad checksum was found), and it doesn't necessarily point to a problem with the transaction in flight when the power-fail occurred... we had about 10 transactions after that. Hrm. Double-hrm; if it's not the transaction that was right during the power-fail, maybe this is a hint/clue for the other bug that claims corruption after a -clean- shutdown. Probably time to go look for races, buffer re-use, etc in the checksumming code...? > Double-hrm; if it's not the transaction that was right during the power-fail,
> maybe this is a hint/clue for the other bug that claims corruption after a
> -clean- shutdown. Probably time to go look for races, buffer re-use, etc in
> the checksumming code...?
I *hope* that is what it is. The scarrier alternative is that we have
a race/buffer re-use or some other race that we've never noticed
because it's relatively rare enough, and if it's a block that ends up
getting overwritten by a subsequent transaction, it would be a
self-healing problem....
- Ted
Created attachment 23685 [details]
Horrible horrible debugging hack
I started reading this because I was afraid the writeback changes were broken, but tracking down crc errors is really interesting as well. I started with the assumption someone was changing the buffer after we crc but before it hits disk, which in ext4 terms means that someone is probably changing data that has been sent to the log without properly diving into the jdb2 do_get_write_access()
Eric helped prove this is the case by testing a patch that always crcs and writes a stable duplicate of the metadata buffer instead of the metadata buffer itself (a one liner since the log code was already setup for this).
This attachment is a hack but the basic idea is to set the page for a buffer head readonly before we crc the page, and then set it read/write again when we're done with it. It will only work when the pagesize and the blocksize are the same.
It probably has some false positives lurking, but so far it has found a number of places where ext4 is changing a buffer without properly calling into the jdb2 do_get_write_access() func.
Eric hit a few easy ones:
ext4_statfs():
/* why is statfs changing the super? */
ext4_free_blocks_count_set(es, buf->f_bfree);
ext4_setxattr_set_handle()
memset(raw_inode, 0, EXT4_SB(inode->i_sb)->s_inode_size);
But I hit a much harder case to fix. ext4_symlink will log the data blocks that hold the symlink name (well ext4_write_begin will log them).
ext4 uses the generic symlink read helpers, so when we read the link we dive into page_getlink() which uses nd_terminate_link to stuff a NULL into the end of the link's name. Since the null goes directly into the page cache page and the page was logged, we're changing things that could affect the log crcs.
I think a more generic and less hacky form of this patch would be a really good thing...maybe it already exists and I just don't know about it yet ;)
On Fri, 6 Nov 2009, Chris Mason <chris.mason@oracle.com> wrote: > > But I hit a much harder case to fix. ext4_symlink will log the data blocks > that hold the symlink name (well ext4_write_begin will log them). > > ext4 uses the generic symlink read helpers, so when we read the link we dive > into page_getlink() which uses nd_terminate_link to stuff a NULL into the end > of the link's name. Gaah. We've had some issues with that before too. Sadly, the name lookup routines _really_ want the terminating NUL at the end, so the alternatives are to (a) allocate scratch space and copy every time you follow a symlink or (b) do major surgery on the name lookup code to take a 'ptr,len' approach to names. Both seem nasty. For ext4, one thing I would suggest is to simply always put the NUL at the end of the name, even if the filesystem might not need it. Then nd_terminate_link() will still write the NUL, but it won't change any data (and thus no CRC's). That said, I haven't looked at how nasty it would be to change the name lookup to use <ptr,len>. Maybe it wouldn't be as horrible as I suspect it is (the filesystems themselves wouldn't need changing, since the lowlevel filesystem name lookup routines always sees a 'struct dentry' rather than the pathname). Linus (In reply to comment #178) > For ext4, one thing I would suggest is to simply always put the NUL at the > end of the name, even if the filesystem might not need it. Then > nd_terminate_link() will still write the NUL, but it won't change any data > (and thus no CRC's). Ah, good idea in the short term at least. FWIW I've got the other 2 places I hit fixed up and my testcase is surviving thus far. I'll find a testcase for the symlink, fix that up, and get some patches out. Big kudos to Chris for the swanky patch to make this all obvious... -Eric On Fri, 6 Nov 2009, Eric Sandeen <sandeen@redhat.com> wrote: > > > For ext4, one thing I would suggest is to simply always put the NUL at the > > end of the name, even if the filesystem might not need it. Then > > nd_terminate_link() will still write the NUL, but it won't change any data > > (and thus no CRC's). > > Ah, good idea in the short term at least. Btw, if you do that, then for Chris' debug patch you might then make nd_terminate_link() do a conditional write ("only write NUL if it isn't already NUL"), and thus avoid the debug code triggering for that case. Linus Reply-To: hch@infradead.org On Fri, Nov 06, 2009 at 10:07:15PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > That said, I haven't looked at how nasty it would be to change the name > lookup to use <ptr,len>. Maybe it wouldn't be as horrible as I suspect it > is (the filesystems themselves wouldn't need changing, since the lowlevel > filesystem name lookup routines always sees a 'struct dentry' rather than > the pathname). FYI there are other filesystems that historically store the the name without terminating NULL in the symlink and would benefit from that. On Fri, Nov 06, 2009 at 10:07:15PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > For ext4, one thing I would suggest is to simply always put the NUL at the > end of the name, even if the filesystem might not need it. Then > nd_terminate_link() will still write the NUL, but it won't change any data > (and thus no CRC's). It's pretty simple to do this; in fact it's a one-line change in ext4_symlink() in fs/ext4/inode.c: /* clear the extent format for fast symlink */ EXT4_I(inode)->i_flags &= ~EXT4_EXTENTS_FL; inode->i_op = &ext4_fast_symlink_inode_operations; - memcpy((char *)&EXT4_I(inode)->i_data, symname, l); + strcpy((char *)&EXT4_I(inode)->i_data, symname); inode->i_size = l-1; - Ted On Sun, 8 Nov 2009, bugzilla-daemon@bugzilla.kernel.org wrote: > > It's pretty simple to do this; in fact it's a one-line change in > ext4_symlink() in fs/ext4/inode.c: > > /* clear the extent format for fast symlink */ > EXT4_I(inode)->i_flags &= ~EXT4_EXTENTS_FL; > inode->i_op = &ext4_fast_symlink_inode_operations; > - memcpy((char *)&EXT4_I(inode)->i_data, symname, l); > + strcpy((char *)&EXT4_I(inode)->i_data, symname); > inode->i_size = l-1; I think that's wrong. First off, it's slower. Secondly, 'l' is already 'strlen+1', so it looks like ext4 _already_ copies the NUL terminating character. So I suspect the debugging patch complained about writing a NUL to a location that was already NUL to begin with. Linus > First off, it's slower. Secondly, 'l' is already 'strlen+1', so it looks
> like ext4 _already_ copies the NUL terminating character.
>
> So I suspect the debugging patch complained about writing a NUL to a
> location that was already NUL to begin with.
Umm, yeah. Good point. Since the debugging patch marks the page ro,
short of doing a test before trying to null-terminate the symlink, the
debugging patch is always going to flag a false-positive here.
- Ted
Good point on the symlink code we must be copying the null already. There's another strange one I hit at mount time last week but forgot to include. It could be noise from earlier debugging runs of the patch that left readonly pages setting around, but I think if I squint really it makes sense. During mounts and remounts, ext4 will mess around with the flags on the journal super. I don't see exactly where the buffer is logged or prepared...most of the callers of jbd2_journal_clear_features look like they need a bigger audit to me. Does anyone know the perf cost of forcing the page ro/rw? I'm looking at integrating it into a CONFIG_DEBUG_CSUMS for btrfs. For any individual page it shouldn't happen too often, but we'd be hammering on it pretty good during a commit. On Tuesday 17 November 2009, Theodore Tso wrote:
> On Mon, Nov 16, 2009 at 11:37:39PM +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.31. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14354
> > Subject : Bad corruption with 2.6.32-rc1 and upwards
> > Submitter : Holger Freyther <zecke@selfish.org>
> > Date : 2009-10-09 15:42 (39 days old)
>
> Um, this was marked as resolved, until you reopened it and then reset
> the state to New. Why did you do this?
>
> It's fixed in mainline as of commit d4da6c9 when Linus reverted commit
> d0646f7. Users could still see it if they mount a file system with -o
> journal_checksum, but (a) it's no longer the default, and (b)
> corruption if you use the non-default journal_checksum mount option is
> not a regression.
>
> We have fixes to make journal_checksum safe queued for 2.6.33, but the
> revert fixes the regression problem.
|