Created attachment 256861 [details] Example backtrace and log of freezing dockerd Since kernel 4.10, there has been a noticable increase in processes becoming zombies as well as processes being force killed after they freeze up. A common denominator for these issues seems to be that F2FS seems to be mentioned in the journal (via journalctl from systemd) related to the issues. This appears to happen completely at random. Sometimes processes that crash (that also crashed before this issue) can be killed by GNOME's "This application is unresponsive, would you like to kill it?" dialog. Instead, since this issue started appearing, killing the application will not actually kill it, the window will remain stuck and the application will become a zombie instead. Additionally, other applications that do not have interfaces are freezing and being terminated at random. The log traces are always similar to the attached log, mentioning the schedule call and the F2FS flush, only varying in the actual application being killed (dockerd, systemd-journal, pool, firefox, dconf-service, ...). This started appearing in kernel 4.10 and is still a problem in 4.11. Downgrading to kernel 4.9 seems to solve the issue. I'm reposting this issue here after it was posted on the Arch Linux bugtracker [1], as well as its forums [2]. [1] https://bugs.archlinux.org/task/53663 [2] https://bbs.archlinux.org/viewtopic.php?pid=1715404
For me it's 100% reproducible when running `npm install` with the following `package.json`: ``` { "name": "foo", "version": "0.0.1", "description": "", "author": "bar", "license": "MIT", "dependencies": { "bulma": "^0.4.1", "js-cookie": "^2.1.3", "lodash": "^4.17.4", "vue": "^2.2.1", "vue-i18n": "^6.0.0-alpha.6", "vue-multiselect": "^2.0.0-beta.14", "vue-nprogress": "^0.1.5", "vue-resource": "^1.2.1", "vue-router": "^2.3.0", "vue-shortkey": "^2.1.0" }, "devDependencies": { "clean-webpack-plugin": "^0.1.16", "css-loader": "^0.26.1", "extract-text-webpack-plugin": "^2.0.0-rc.3", "node-sass": "^4.5.0", "sass-loader": "^6.0.2", "vue-loader": "^11.1.3", "vue-template-compiler": "^2.2.1", "webpack": "^2.2.1", "webpack-bundle-tracker": "^0.2.0", "webpack-merge": "^4.1.0" }, "jshintConfig": { "esversion": 6, "strict": "global", "asi": true, "browser": true, "browserify": true, "jquery": false } } ``` npm process gets stuck forever.
Dear all, Same issue occurs with ceph (mon directory on f2fs) for the root partition. Best regards Tobias
Arch just shipped 4.12.3. I didn't experience issues at first, but then tried artificially creating it by starting and stopping docker a few times and lo' and behold: it resurfaced. I've switched back to 4.9 LTS again, as it's been running for two months without problems now. My problems seem to be mostly limited to annoying application freezes and zombie processes, but a friend's system completely went corrupt after some time with the logs exposing the issue multiple times (he didn't know he had the issue before).
Helllloooooooooooooo f2fs devs? How can we help you tackle this issue?
Sorry for later reply. :( I suspect that it is a bug of flush_merge feature, in last issued 4.14-rc1 kernel we have just fix some potentail issues of this feature which would lead userspace apps hanging sometime, so I'd like to suggest to try last f2fs codes in issued kernel to see that whether we have fixed that issue. Also, could you track this issue in thread of f2fs mailing list: https://sourceforge.net/p/linux-f2fs/mailman/message/36037901/
Thanks for your reply! I will try switching to the currently stable Linux 4.13 in Arch and activating the noflush_merge option and see if the issue still appears. I've been trying to resurface the issue by going wild with docker a bit, like before, but this didn't trigger it anymore. I've been using the noflush_merge option for about a day now and so far, the issue has not occured anymore. This seems to indicate that the issue is indeed in the flush merging functionality. I'll try to stay on 4.13 with noflush_merge for a while and see if anything bad happens. If not, at least I have a way to use the more recent kernels with F2FS :-).
The issue occurs rarely. I believe that we need **more testing** before declaring it fixed. @mwohah, should we publish testing instructions on the ArchLinux forum in order to get more testers? However, I am not sure what exactly I need to do in order to test the fix.
@me I agree it needs more testing. Though, some use cases seem to expose the issue fairly often, such as npm and docker, possibly due to the amount of I/O that is involved. We could post it on the Arch forums, though I already made a topic there (see the OP) as well as on the bug tracker, both of which have links to this ticket, since it is a kernel issue. However, I do agree it could use a bit more visibility (perhaps on the Arch wiki page on f2fs?) in that users that are installing or want to install f2fs now should know that this issue exists, since it can cause instability and, granted that nomerge_flush fixes it, the workaround is rather trivial.
@Chao Yu I've updated to 4.14 some days ago and switched noflush_merge back to flush_merge two days ago. So far, I haven't encountered any problems. Here's to hoping the problem is solved! Should the problem return, I'll post something here. I wanted to remind others of the fixes in 4.14, as they might encounter a different experience. Thanks!
@mwohah, Thanks for your test and feedback. :)
Hey, sorry to bother two years later, but I am considering switching to F2FS again and I wonder if lack of further activity in this bug is because the problem got fixed or because everyone affected migrated away from F2FS (I certainly did).
(In reply to Szczepan from comment #11) > Hey, sorry to bother two years later, but I am considering switching to F2FS > again and I wonder if lack of further activity in this bug is because the > problem got fixed or because everyone affected migrated away from F2FS (I > certainly did). I am still using F2FS.
(In reply to Szczepan from comment #11) > Hey, sorry to bother two years later, but I am considering switching to F2FS > again and I wonder if lack of further activity in this bug is because the > problem got fixed or because everyone affected migrated away from F2FS (I > certainly did). I guess it worths for you to have a try again with f2fs, as we added lots of features and made code more stable in last two years. Also I knew there are users using f2fs of kernel v5.6rc1 as root partition filesystem, except one task hang issue we have fixed in our git tree, I didn't get any further bug reports.
(In reply to me from comment #12) > (In reply to Szczepan from comment #11) > > Hey, sorry to bother two years later, but I am considering switching to > F2FS > > again and I wonder if lack of further activity in this bug is because the > > problem got fixed or because everyone affected migrated away from F2FS (I > > certainly did). > > I am still using F2FS. Cool, thanks for the trust.