At file ./linux/fs/ext4/extents.c there is this function: int ext4_ext_truncate(handle_t *handle, struct inode *inode) {...} Which is being called all the time, from the moment the machine is booted. Steps to reproduce: To witness it, just inject a simple printk at the beggining of the function and checkout dmesg. After some lookup, it turns out the inode that is being passed to the function is of the file: /var/log/journal/7c8e96117d45417e980f5fec3775d67f/system.journal. The calling function is ext4_setattr(...) in the file ./linux/fs/ext4/inode.c. At that function we have the following code: /* * Call ext4_truncate() even if i_size didn't change to * truncate possible preallocated blocks. */ if (attr->ia_size <= oldsize) { rc = ext4_truncate(inode); if (rc) error = rc; } Looks like this function is being called on system.journal file but without actually having it's size changed. Although the comment appears to justify calling the function even when the size is unchanged, it is probably unintended that the function will be ran at such a high frequency without doing anything. I'd like to ask for thoughts on this, and I'd be glad to fix it myself if possible.
Clarification: When I say the function is being "called all the time", I mean it is being run at fast succession, and there is nothing that triggers it, solely running in the background. The problem here is it consumes some of the CPU.
It/s not true that "nothing is triggering it". What is triggering it systemd-journald, which is apparently calling ftruncate(2) on its log file every single time that writes to the binary log file. I'm not sure why journald is calling ftruncate(2) to the size of the file, but this is causing a call to ext4_setattr(), which is what is calling ext4_truncate. So when you added the printk in the codepath of ext4_truncate(), this causes a kernel log to be emitted, which triggered journald, which then would write into the circular log file /var/log/journal/XXXXXXXXX/system.journal. Why it is repeatedly calling ftruncate(2) in what is apparently a no-op, is unclear to me, but this triggeres another kernel log, which triggers a journald, in an repeated loop. So it is was the very act of adding the printk which caused it to be printed "all the time". Under normal circumstances, this shouldn't be that much of an issue. Now, it is true that we could optimize ftruncate(2) for workloads that are calling truncate very often. Calling ext4_truncate() does involve starting a journal handle, and adding and removing the inode from the orphan list. We just didn't think that there would be programs calling ftruncate(2) in a tight loop with arguments that made it effectively a no-op. It would be possible to add a flag to the in-memory inode if it has been truncated, so that we would know that in the case where the ftruncate(2) is a no-op, we can skip calling ext4_truncate. If fallocate is called on the inode with the FALLOC_FL_KEEP_SIZE, we would clear that in-memroy flag. Other ways of address this, if you really want to track how often ext4_truncate gets called without triggering the journald doom loop would be to use ftrace and/or tracepoints instead of using printk. This is much more efficient, since the log is written into an in-kernel circular buffer, and is then expanded from binary form into human-readable task without having to invoke systemd's journald. And finally, someone could reach out to the systemd developers and ask why they are calling ftruncate(2) in many cases where it appears to be completely unnecessary.
You are right, the printk was causing the problem. I figured it using ftrace. However, I can still see how this problem can arise once there is some (necessary) printk called along the call stack of ftruncate(2). Just for example, in the function ext4_truncate() there is this code: if(!(inode->i_state & (I_NEW|I_FREEING))) WARN_ON(!inode_is_locked(inode)); WARN_ON is essentially a printk. While it won't be called most of the times, once it is, the discussed problem will appear again (until the assertion wont be true atleast). I will try to fix this.
"Won't be called most of the time" is rather understating the case. The WARN_ON should never, *EVER* fire. If it does we have a kernel bug, and if the user has panic_on_warn set, the kernel will crash. Also, please note that the WARN_ON would triggers when we have a inode-specific inconsistency. In the extremely unlikely case that the WARN_ON triggers, the doom loop that you are worried about would only happen if (a) panic_on_warn is disabled, and (b) the inconsistent data state happens to be the journald's log file. If the file system has millions of inodes, we're talking about a probability of one in millions, even if the kernel is buggy and triggers at all.