Bug 195983 - [f2fs] zombie processes and freezing application related to schedule and f2fs_issue_flush
Summary: [f2fs] zombie processes and freezing application related to schedule and f2fs...
Status: ASSIGNED
Alias: None
Product: File System
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: fs_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-04 19:17 UTC by mwohah
Modified: 2020-03-27 08:02 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.12.3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Example backtrace and log of freezing dockerd (1.28 KB, text/plain)
2017-06-04 19:17 UTC, mwohah
Details

Description mwohah 2017-06-04 19:17:34 UTC
Created attachment 256861 [details]
Example backtrace and log of freezing dockerd

Since kernel 4.10, there has been a noticable increase in processes becoming zombies as well as processes being force killed after they freeze up. A common denominator for these issues seems to be that F2FS seems to be mentioned in the journal (via journalctl from systemd) related to the issues.

This appears to happen completely at random. Sometimes processes that crash (that also crashed before this issue) can be killed by GNOME's "This application is unresponsive, would you like to kill it?" dialog. Instead, since this issue started appearing, killing the application will not actually kill it, the window will remain stuck and the application will become a zombie instead.

Additionally, other applications that do not have interfaces are freezing and being terminated at random. The log traces are always similar to the attached log, mentioning the schedule call and the F2FS flush, only varying in the actual application being killed (dockerd, systemd-journal, pool, firefox, dconf-service, ...).

This started appearing in kernel 4.10 and is still a problem in 4.11. Downgrading to kernel 4.9 seems to solve the issue.

I'm reposting this issue here after it was posted on the Arch Linux bugtracker [1], as well as its forums [2].

[1] https://bugs.archlinux.org/task/53663
[2] https://bbs.archlinux.org/viewtopic.php?pid=1715404
Comment 1 Yill Din 2017-06-15 06:51:00 UTC
For me it's 100% reproducible when running `npm install` with the following `package.json`:

```
{
  "name": "foo",
  "version": "0.0.1",
  "description": "",
  "author": "bar",
  "license": "MIT",
  "dependencies": {
    "bulma": "^0.4.1",
    "js-cookie": "^2.1.3",
    "lodash": "^4.17.4",
    "vue": "^2.2.1",
    "vue-i18n": "^6.0.0-alpha.6",
    "vue-multiselect": "^2.0.0-beta.14",
    "vue-nprogress": "^0.1.5",
    "vue-resource": "^1.2.1",
    "vue-router": "^2.3.0",
    "vue-shortkey": "^2.1.0"
  },
  "devDependencies": {
    "clean-webpack-plugin": "^0.1.16",
    "css-loader": "^0.26.1",
    "extract-text-webpack-plugin": "^2.0.0-rc.3",
    "node-sass": "^4.5.0",
    "sass-loader": "^6.0.2",
    "vue-loader": "^11.1.3",
    "vue-template-compiler": "^2.2.1",
    "webpack": "^2.2.1",
    "webpack-bundle-tracker": "^0.2.0",
    "webpack-merge": "^4.1.0"
  },
  "jshintConfig": {
    "esversion": 6,
    "strict": "global",
    "asi": true,
    "browser": true,
    "browserify": true,
    "jquery": false
  }
}
```

npm process gets stuck forever.
Comment 2 NotTheEvilOne 2017-06-17 16:04:57 UTC
Dear all,

Same issue occurs with ceph (mon directory on f2fs) for the root partition.

Best regards
Tobias
Comment 3 mwohah 2017-07-29 18:27:56 UTC
Arch just shipped 4.12.3. I didn't experience issues at first, but then tried artificially creating it by starting and stopping docker a few times and lo' and behold: it resurfaced.

I've switched back to 4.9 LTS again, as it's been running for two months without problems now.

My problems seem to be mostly limited to annoying application freezes and zombie processes, but a friend's system completely went corrupt after some time with the logs exposing the issue multiple times (he didn't know he had the issue before).
Comment 4 Szczepan 2017-08-30 18:39:11 UTC
Helllloooooooooooooo f2fs devs? How can we help you tackle this issue?
Comment 5 Chao Yu 2017-09-23 00:54:04 UTC
Sorry for later reply. :(

I suspect that it is a bug of flush_merge feature, in last issued 4.14-rc1 kernel we have just fix some potentail issues of this feature which would lead userspace apps hanging sometime, so I'd like to suggest to try last f2fs codes in issued kernel to see that whether we have fixed that issue.

Also, could you track this issue in thread of f2fs mailing list:
https://sourceforge.net/p/linux-f2fs/mailman/message/36037901/
Comment 6 mwohah 2017-09-29 14:44:55 UTC
Thanks for your reply!

I will try switching to the currently stable Linux 4.13 in Arch and activating the noflush_merge option and see if the issue still appears.

I've been trying to resurface the issue by going wild with docker a bit, like before, but this didn't trigger it anymore. I've been using the noflush_merge option for about a day now and so far, the issue has not occured anymore.

This seems to indicate that the issue is indeed in the flush merging functionality.

I'll try to stay on 4.13 with noflush_merge for a while and see if anything bad happens. If not, at least I have a way to use the more recent kernels with F2FS :-).
Comment 7 me 2017-09-29 17:13:08 UTC
The issue occurs rarely. I believe that we need **more testing** before declaring it fixed. @mwohah, should we publish testing instructions on the ArchLinux forum in order to get more testers? However, I am not sure what exactly I need to do in order to test the fix.
Comment 8 mwohah 2017-09-30 16:57:03 UTC
@me I agree it needs more testing. Though, some use cases seem to expose the issue fairly often, such as npm and docker, possibly due to the amount of I/O that is involved.

We could post it on the Arch forums, though I already made a topic there (see the OP) as well as on the bug tracker, both of which have links to this ticket, since it is a kernel issue.

However, I do agree it could use a bit more visibility (perhaps on the Arch wiki page on f2fs?) in that users that are installing or want to install f2fs now should know that this issue exists, since it can cause instability and, granted that nomerge_flush fixes it, the workaround is rather trivial.
Comment 9 mwohah 2017-12-07 19:32:35 UTC
@Chao Yu I've updated to 4.14 some days ago and switched noflush_merge back to flush_merge two days ago. So far, I haven't encountered any problems. Here's to hoping the problem is solved!

Should the problem return, I'll post something here. I wanted to remind others of the fixes in 4.14, as they might encounter a different experience.

Thanks!
Comment 10 Chao Yu 2017-12-12 13:47:24 UTC
@mwohah, Thanks for your test and feedback. :)
Comment 11 Szczepan 2020-03-26 07:03:31 UTC
Hey, sorry to bother two years later, but I am considering switching to F2FS again and I wonder if lack of further activity in this bug is because the problem got fixed or because everyone affected migrated away from F2FS (I certainly did).
Comment 12 me 2020-03-26 09:06:02 UTC
(In reply to Szczepan from comment #11)
> Hey, sorry to bother two years later, but I am considering switching to F2FS
> again and I wonder if lack of further activity in this bug is because the
> problem got fixed or because everyone affected migrated away from F2FS (I
> certainly did).

I am still using F2FS.
Comment 13 Chao Yu 2020-03-27 08:00:40 UTC
(In reply to Szczepan from comment #11)
> Hey, sorry to bother two years later, but I am considering switching to F2FS
> again and I wonder if lack of further activity in this bug is because the
> problem got fixed or because everyone affected migrated away from F2FS (I
> certainly did).

I guess it worths for you to have a try again with f2fs, as we added lots of features and made code more stable in last two years.

Also I knew there are users using f2fs of kernel v5.6rc1 as root partition filesystem, except one task hang issue we have fixed in our git tree, I didn't
get any further bug reports.
Comment 14 Chao Yu 2020-03-27 08:02:23 UTC
(In reply to me from comment #12)
> (In reply to Szczepan from comment #11)
> > Hey, sorry to bother two years later, but I am considering switching to
> F2FS
> > again and I wonder if lack of further activity in this bug is because the
> > problem got fixed or because everyone affected migrated away from F2FS (I
> > certainly did).
> 
> I am still using F2FS.

Cool, thanks for the trust.

Note You need to log in before you can comment on or make changes to this bug.