I saw a ext4 "orphan list check failed" in my logs. It is happening on a new 1TB hard disk and mostly when idle. I did a smart check and a a fsck check for bad sectors from the arch linux rescue usb. neither saw anything. I formatted the disk again and restored a backup. Then I thought I would just get a replacement instead of wait for this to happen again. Same issue happened 50 minutes ago. "Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum". and formatted under kernel 4.8 with latest e2fsprogs. Checking the inode that caused the hiccup reveals it points to a folder in /usr/share/ that was not touched since the 25th of last month and I have fscked and rebooted plenty of times since then. I fsck on each reboot anyway. I did a fsck -D -fv /dev/mapper/root from rescue usb after last reboot too. Could the metadata_csum option be causing this? the kernel.org ext4 wiki suggested it is a integrity measure. My backups don't include system journals to save space on backup media so I don't have the full kernel crash log. I am sure there was no I/O error message in the crash log when I typed dmesg and I was told on arch linux logs that this rules out any hardware errors. I believe this is the second time this has happened.
We really badly need the logs to be able to understand more of what's going on. This error doesn't represent an on-disk corruption, but rather an inconsistency in an in-memory data structure. As such, reformatting and reloading from backups wasn't necessary, and it's not surprising e2fsck didn't find anything. What the error means is that at the time when the kernel tried to release an inode from memory, it was apparently on the orphan linked list. This is a "should never happen" situation, and indicates either a kernel bug in ext4, a hardware induced memory bit-flip, or a kernel bug somewhere else that involved a wild pointer dereference that corrupted the data structure in question. This is why we really need the logs to see what might have happened. The dump of the data structure is critical here. It doesn't make sense that the inode "points to a folder in /usr/share". Do you mean that the inode literally corresponds to a directory? Directories never are on the orphan list; only regular files, and only when they are being deleted or truncated, or a few other specialized circumstances. And once the deletion or truncation is completed, they are removed from the orphan list, as well as the linked list. So something is really wrong, and the logs would be very helpful to try to figure out what might be going on. The good news is your data shouldn't be at risk, and if you can come up with a solid reproduction case, that would be especially helpful, especially if we can reproduce it on another machine.
I will post a backtrace of the crash if it happens again. I had written down the output of the inode check on a paper. debugfs -R 'ncheck 54264316' /dev/dm-0 debugfs 1.43.3 (04-Sep-2016) Inode Pathname 54264316 /usr/share/help/oc/totem/figures I should be more or less correct since I wrote the whole thing on paper. The folder itself exists. The only thing I remember which may have touched this folder since it was last modified on the 25th of october is that I rebooted to the rescue usb disk once and did a e4defrag. but I do no even remember the date when I did so. tune2fs -l /dev/mapper/root says: Mount count: 1 Maximum mount count: 1 My hard disks are all fscked at reboots. I will monitor if this happens again and post a full log. Thank you for the instructions.
So /usr/share/help/oc/totem/figures is a *directory*? A directory has no business being on the orphan list, so something very strange is indeed going on. If this happens again, please using debugfs to get a stat of the inode in question: debugfs /dev/dm-0 debugfs: stat <54264316> the timestamps in the inode might be helpful.
- Yes, a directory. This is partially what scared me. - Ok, thank you. I will do so if it happens again.
If it happens again, my suggestion would be to capture as much information again, and then just reboot the system. You can fsck the file system if you are feeling especially paranoid, but it's likely that any corruption in data structures are in-memory only. That being said, the in-memory corruptions could cause other corruptions that could eventually corrupt data or metadata. So rebooting right away would be a good idea.
(In reply to Theodore Tso from comment #3) > So /usr/share/help/oc/totem/figures is a *directory*? A directory has no > business being on the orphan list, so something very strange is indeed going > on. It *is* possible to have a directory on the orphan list, since it is possible to have an open-unlinked directory if one cd's into the directory before it is unlinked (note nlink = 0, dtime is an inode number, not time): $ mkdir foo $ cd foo $ rmdir ../foo $ ls -lid . 289432 drwxrwxr-x 0 mythtv mythtv 0 Nov 7 13:23 . $ sudo debugfs -c -R "stat <289432>" /dev/vg_basemyth/lvroot.basemyth debugfs 1.42.7.wc2 (07-Nov-2013) Inode: 289432 Type: directory Mode: 0775 Flags: 0x0 Generation: 2752543570 Version: 0x00000000:00000002 User: 500 Group: 500 Size: 4096 File ACL: 0 Directory ACL: 0 Links: 0 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x5820e2b5:4aa59b94 -- Mon Nov 7 13:23:17 2016 atime: 0x5820e2b0:0048b0c8 -- Mon Nov 7 13:23:12 2016 mtime: 0x5820e2b0:0048b0c8 -- Mon Nov 7 13:23:12 2016 crtime: 0x5820e2b0:0048b0c8 -- Mon Nov 7 13:23:12 2016 dtime: 0x0003f42b -- Sat Jan 3 16:58:35 1970 Size of extra inode fields: 28 BLOCKS: (0):1258994
Well, yes, but in this case the directory hasn't been deleted, as I understand the original bug report. And you can't truncate a directory....
Hi. It just happened again. [root@hades hussam]# debugfs /dev/dm-0 debugfs 1.43.3 (04-Sep-2016) debugfs: stat <58073259> Inode: 58073259 Type: directory Mode: 0755 Flags: 0x80000 Generation: 1123105798 Version: 0x00000000:0000003e User: 0 Group: 0 Project: 0 Size: 4096 File ACL: 0 Directory ACL: 0 Links: 2 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x58709c64:841efcb0 -- Sat Jan 7 09:44:36 2017 atime: 0x588bc31e:c21e1c40 -- Sat Jan 28 00:01:02 2017 mtime: 0x58709c4e:00000000 -- Sat Jan 7 09:44:14 2017 crtime: 0x58709c64:6127674c -- Sat Jan 7 09:44:36 2017 Size of extra inode fields: 32 Inode checksum: 0xfb5c96a4 EXTENTS: (0):232268326 debugfs: quit [root@hades hussam]# debugfs -R 'ncheck 58073259' /dev/dm-0 debugfs 1.43.3 (04-Sep-2016) Inode Pathname 58073259 /usr/share/swig/3.0.11/scilab
Created attachment 253371 [details] logged trace This is the crash data according to journalctl
... and this case the directory has nlinks > 0, so it definitely shouldn't have been on the orphaned list at all. Can you tell me anything about the workload of your system? And given the mod times, it looks like it hasn't been modified any time recently, either. This is really starting to smell like some kind in-memory corruption --- caused perhaps by a hardware fault causing a bit flip, or some random bug in any part of the kernel. What has changed recently? Were you using an older kernel? What was the last known kernel version that was problem-free for you? Have you tried using Memtest86+ to make sure you don't have any memory issues?
At the time, the system was under very heavy load as I was compiling libreoffice and it was heavily swapping. I recently upgraded from linux 4.8.13 to 4.9.5 and then 4.9.6 I will do a Memtest86+ test to check if the issue is memory related. Thank you.
One more note. I definitely never had this issue before linux 4.8.
Created attachment 253481 [details] memtest at 65%
Created attachment 253491 [details] memtest at 78%
Created attachment 253501 [details] memtest at 99%
Created attachment 253511 [details] memtest almost complete. Hi again. I did a memtest though only one pass since I need to type some documents. I think it finished without errors.
So you're seeing it on both 4.9 and 4.8, correct? How reliably can you reproduce the failure? It's possible turning off metadata_csum might make a difference, although it would seem surprising since the expected failure modes would involve checksum failures, and that's not what you're seeing.
No, it was only happening once a month or so. I upgraded today to a corei5 with 8GB ram. I just plugged in my old hard disks and rebuilt the initramfs image. Perhaps the old hardware was too little for the load I was putting on the computer. I'll wait and see. Hopefully the issues will be gone.
I haven't seen this again on the new computer. I'm closing this as invalid.
I just saw a similar error on 4.13.3 kernel. In my case inode in question was a file. It's the first time I see this error and I haven't changed anything in the machine for a while (aside from updating the kernel few weeks ago). uname -a Linux htpc 4.13.3-gentoo #2 SMP PREEMPT Tue Sep 26 14:10:32 CEST 2017 x86_64 AMD Athlon 5370 APU with Radeon R3 AuthenticAMD GNU/Linux dmesg: [280833.904264] EXT4-fs (sda1): Inode 154567855 (ffff880100005f80): orphan list check failed! [280833.904276] ffff880100005f80: 0001f30a 00000004 00000000 00000000 ................ [280833.904278] ffff880100005f90: 00000001 24d8e41a 00000000 00000000 .......$........ [280833.904280] ffff880100005fa0: 00000000 00000000 00000000 00000000 ................ [280833.904282] ffff880100005fb0: 00000000 00000000 00000000 00000000 ................ [280833.904284] ffff880100005fc0: 00000000 00000000 000049b4 00000000 .........I...... [280833.904286] ffff880100005fd0: 00080000 00000080 00000000 00000000 ................ [280833.904288] ffff880100005fe0: 00005fe0 ffff8801 00005fe0 ffff8801 ._......._...... [280833.904290] ffff880100005ff0: 00000000 00000000 00000000 00000000 ................ [280833.904292] ffff880100006000: 004f6d2c ffff8801 00006000 ffff8801 ,mO......`...... [280833.904294] ffff880100006010: 000009a8 00000000 00000000 00000000 ................ [280833.904296] ffff880100006020: 00006020 ffff8801 00006020 ffff8801 `...... `...... [280833.904297] ffff880100006030: 00000000 00000000 00000000 00000000 ................ [280833.904299] ffff880100006040: 00000000 00000000 00006048 ffff8801 ........H`...... [280833.904301] ffff880100006050: 00006048 ffff8801 00000000 00000000 H`.............. [280833.904303] ffff880100006060: 00000000 00000000 000d81a4 000000fa ................ [280833.904305] ffff880100006070: 000000fa 00001000 ffffffff ffffffff ................ [280833.904306] ffff880100006080: ffffffff ffffffff 81617100 ffffffff .........qa..... [280833.904308] ffff880100006090: 39c42000 ffff8801 000061c8 ffff8801 . .9.....a...... [280833.904310] ffff8801000060a0: 093684af 00000000 00000001 00000000 ..6............. [280833.904312] ffff8801000060b0: 000009a8 00000000 59d0f5d4 00000000 ...........Y.... [280833.904314] ffff8801000060c0: 1e1a23d3 00000000 59cf614a 00000000 .#......Ja.Y.... [280833.904315] ffff8801000060d0: 00000000 00000000 59d0f5d4 00000000 ...........Y.... [280833.904317] ffff8801000060e0: 1e1a23d3 00000000 00000000 00000000 .#.............. [280833.904353] ffff8801000060f0: 0000000c 00000000 00000008 00000000 ................ [280833.904355] ffff880100006100: 00000060 00000000 00000000 00000000 `............... [280833.904357] ffff880100006110: 00006110 ffff8801 00006110 ffff8801 .a.......a...... [280833.904359] ffff880100006120: 00000000 00000000 00000000 00000000 ................ [280833.904360] ffff880100006130: 0569d91c 00000001 00000000 00000000 ..i............. [280833.904363] ffff880100006140: 00000000 00000000 00000000 00000000 ................ [280833.904364] ffff880100006150: 00006150 ffff8801 00006150 ffff8801 Pa......Pa...... [280833.904366] ffff880100006160: 00006160 ffff8801 00006160 ffff8801 `a......`a...... [280833.904368] ffff880100006170: 00006170 ffff8801 00006170 ffff8801 pa......pa...... [280833.904370] ffff880100006180: 00006180 ffff8801 00006180 ffff8801 .a.......a...... [280833.904372] ffff880100006190: 00000000 00000000 811a0610 ffffffff ................ [280833.904373] ffff8801000061a0: 00000001 00000000 00000000 00000000 ................ [280833.904375] ffff8801000061b0: 00000000 00000000 816171c0 ffffffff .........qa..... [280833.904377] ffff8801000061c0: 00000000 00000000 00006068 ffff8801 ........h`...... [280833.904379] ffff8801000061d0: 01180020 00000000 00000000 00000000 ............... [280833.904380] ffff8801000061e0: 00000000 00000000 00000000 00000000 ................ [280833.904382] ffff8801000061f0: 00000000 00000000 000061f8 ffff8801 .........a...... [280833.904384] ffff880100006200: 000061f8 ffff8801 00000000 00000000 .a.............. [280833.904385] ffff880100006210: 00000000 00000000 00000000 00000000 ................ [280833.904387] ffff880100006220: 00000000 00000000 00000001 00000000 ................ [280833.904389] ffff880100006230: 81617ac0 ffffffff 00000010 00000000 .za............. [280833.904391] ffff880100006240: 00000000 014200ca 00006248 ffff8801 ......B.Hb...... [280833.904392] ffff880100006250: 00006248 ffff8801 00000000 00000000 Hb.............. [280833.904394] ffff880100006260: 00000000 00000000 00006268 ffff8801 ........hb...... [280833.904396] ffff880100006270: 00006268 ffff8801 00000000 00000000 hb.............. [280833.904398] ffff880100006280: 36640dc5 00000000 00000000 00000000 ..d6............ [280833.904399] ffff880100006290: 00000000 00000000 00000000 00000000 ................ [280833.904401] ffff8801000062a0: 00000000 00000000 59d0f5d4 00000000 ...........Y.... [280833.904403] ffff8801000062b0: 1e1a23d3 00000000 000062b8 ffff8801 .#.......b...... [280833.904405] ffff8801000062c0: 000062b8 ffff8801 00000000 00000000 .b.............. [280833.904407] ffff8801000062d0: 00000000 00000000 00000000 00000000 ................ [280833.904409] ffff8801000062e0: 00000000 00000000 000062e8 ffff8801 .........b...... [280833.904411] ffff8801000062f0: 000062e8 ffff8801 00000000 00000000 .b.............. [280833.904412] ffff880100006300: 00000000 ffffffff 00000000 00000000 ................ [280833.904414] ffff880100006310: 00000000 00000020 00000000 00000000 .... ........... [280833.904416] ffff880100006320: 00006320 ffff8801 00006320 ffff8801 c...... c...... [280833.904418] ffff880100006330: ffffffe0 0000000f 00006338 ffff8801 ........8c...... [280833.904419] ffff880100006340: 00006338 ffff8801 8119bbe0 ffffffff 8c.............. [280833.904421] ffff880100006350: 00000000 00000000 00213ce0 00213ce0 .........<!..<!. [280833.904423] ffff880100006360: 00000000 00000000 ........ [280833.904429] CPU: 1 PID: 211 Comm: khugepaged Not tainted 4.13.3-gentoo #2 [280833.904430] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AM1H-ITX, BIOS P1.50 04/29/2015 [280833.904432] Call Trace: [280833.904445] ? dump_stack+0x46/0x5e [280833.904450] ? ext4_destroy_inode+0x81/0x88 [280833.904454] ? dispose_list+0x30/0x40 [280833.904456] ? prune_icache_sb+0x48/0x58 [280833.904460] ? super_cache_scan+0x133/0x180 [280833.904466] ? shrink_slab.part.8.constprop.22+0x19e/0x250 [280833.904468] ? shrink_node+0x58/0x190 [280833.904470] ? try_to_free_pages+0x15a/0x320 [280833.904475] ? __alloc_pages_nodemask+0x3db/0xaf0 [280833.904481] ? collect_expired_timers+0xb0/0xb0 [280833.904485] ? khugepaged+0x256/0x1590 [280833.904491] ? do_wait_intr+0x80/0x80 [280833.904496] ? kthread+0xf2/0x128 [280833.904498] ? collapse_shmem.isra.4+0xa20/0xa20 [280833.904500] ? kthread_create_on_node+0x40/0x40 [280833.904504] ? call_usermodehelper_exec_async+0x134/0x138 [280833.904507] ? ret_from_fork+0x22/0x30 debugfs: stat <154567855> Inode: 154567855 Type: regular Mode: 0644 Flags: 0x80000 Generation: 912526789 Version: 0x00000000:00000001 User: 250 Group: 250 Project: 0 Size: 2472 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x59d0f5d4:78688f4c -- Sun Oct 1 16:04:04 2017 atime: 0x59d0f5d4:78688f4c -- Sun Oct 1 16:04:04 2017 mtime: 0x59cf614a:00000000 -- Sat Sep 30 11:18:02 2017 crtime: 0x59d0f5d4:78688f4c -- Sun Oct 1 16:04:04 2017 Size of extra inode fields: 32 EXTENTS: (0):618193946 debugfs -R 'ncheck 154567855' /dev/sda1 debugfs 1.43.3 (04-Sep-2016) Inode Pathname 154567855 /usr/portage/metadata/md5-cache/dev-ruby/faraday_middleware-multi_json-0.0.6-r1
Can you try to see if you can reliably reproduce the problem? And then can you see if you can reproduce it on some other hardware? Note that one of the reasons why it's unfortunate when people reopen bugs, especially when the original reporter has concluded it was probably caused by a hardware issue in his case, is that it can sometimes be hard to disambiguate data from the first report and successive reports. (This is why I hate it when Ubuntu users, in particular, glom on to reports by googling by symptom, and assume that just because it has the same symptom, that it must be related to every other problem with the same symptom, and all pile onto the same Launchpad bug. It's also why I believe Launchpad is mostly hopeless for a developer....)
I can't reproduce it. So far it was only a single occurrence and I don't even know what caused it. It was also during compilation but it didn't see any problem, only the information in the log. I didn't want to reopen the problem and I'm well aware that it is to transient to be debugged in a systematic way. In light of this I didn't want to open a new bug, it would be mostly pointless. However since it is something that "should never happen" another example of it might (or might not ;) ) shed some light on the issue.