Bug 66821 - "BUG: soft lockup" in proc_fd_link, causing freeze
Summary: "BUG: soft lockup" in proc_fd_link, causing freeze
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: fs_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-10 17:01 UTC by Andreas Reis
Modified: 2014-02-18 14:29 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.13.0-rc3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output during progressing freeze (150.73 KB, application/octet-stream)
2013-12-10 17:01 UTC, Andreas Reis
Details
dmesg output during progressing freeze, 2 (18.11 KB, text/plain)
2013-12-10 20:02 UTC, Andreas Reis
Details
dmesg output of crash (6.90 KB, text/plain)
2013-12-12 16:20 UTC, Andreas Reis
Details

Description Andreas Reis 2013-12-10 17:01:34 UTC
Created attachment 118011 [details]
dmesg output during progressing freeze

Getting this since a few days randomly either as immediate freeze or progressing system freeze. I'm not sure if that's the same bug though (nor if I put this report in the right product/component category, just guessing by the trace).

It starts with
general protection fault: 0000 [#1] PREEMPT SMP
path_get+0x15/0x30
proc_fd_link+0x9e/0xf0
proc_pid_readlink+0x3f/0xe0
SyS_readlinkat+0xeb/0x130
system_call_fastpath+0x1a/0x1f

and becomes
BUG: soft lockup - CPU#5 stuck for 23s! [chrome-sandbox:11068]
proc_fd_link+0x69/0xf0
proc_pid_readlink+0x3f/0xe0
SyS_readlinkat+0xeb/0x130
system_call_fastpath+0x1a/0x1f

Notes: 
* This is based only on one dmesg output; I haven't yet gotten around to compare that of different freezes.
* Kernel is self-compiled via the current sources (17b2112f) in Torvald's git; distro is Arch.
* Always appears to happen after the system has run for a while, ie. not right after boot.
* root is btrfs, home is ext4, var is reiserfs.
Comment 1 Andreas Reis 2013-12-10 20:02:15 UTC
Created attachment 118021 [details]
dmesg output during progressing freeze, 2

Another freeze. This time it's:

proc_cwd_link+0x5d/0xd0
proc_pid_readlink+0x3f/0xe0
SyS_readlinkat+0xeb/0x130
SyS_readlink+0x1b/0x20
system_call_fastpath+0x1a/0x1f
Comment 2 Theodore Tso 2013-12-10 21:20:40 UTC
proc_cwd_link and proc_fd_link are all procfs routines.   So you have something which is constantly scanning files in /proc?

In any case, it's highly unlikely this is an ext4 issues.
Comment 3 Andreas Reis 2013-12-10 21:43:52 UTC
Nothing that I'm aware of, and I don't use anything other than ordinary desktop software. The bug only appears to occur under particular circumstances anyway, since the system will run just fine for an average of perhaps 45 min after booting. (I'm back to kernel 3.12.4 for now.)

Sorry about filing it as ext4 issue, that was a naive and hasty guess.
Comment 4 Andreas Reis 2013-12-12 16:20:05 UTC
Created attachment 118171 [details]
dmesg output of crash

Again a progressing freeze (3.13-git at 9538e100), this time "caused" by copying a file from /home ext4 to a truecrypt ext4 partition and taking the form of a general protection fault.

I could still call the restart command, and it displayed yet another (not captured) trace before the system came to a halt.

I also let the Windows [8.1] Memory Diagnostic run, which reported no errors.

__mem_cgroup_uncharge_common+0xc3/0x380
mem_cgroup_uncharge_cache_page+0x12/0x20
delete_from_page_cache+0x48/0x70
truncate_inode_page+0x5b/0x90
truncate_inode_pages_range+0x169/0x5e0
truncate_inode_pages+0x15/0x20
ext4_evict_inode+0x128/0x530 [ext4]
evict+0xb0/0x1b0
iput+0xf5/0x190
do_unlinkat+0x18e/0x2c0
? SyS_futimesat+0x8b/0xc0
SyS_unlink+0x16/0x20
system_call_fastpath+0x1a/0x1f
Comment 5 Theodore Tso 2013-12-12 19:24:10 UTC
apparently you are using auditing and memory cgroups as well, which are also potential suspects.  It wouldn't surprise me if that some wild pointer has corrupted some memory, and so the bug might not be in the strack traces which you are seeing.

Since you are using lots of fun and interesting kernel subsystems, would it be possible to try eliminating/avoid using some of them (i.e. audit, truecrypt, memcg, btrfs, reiserfs, etc.) and see if the problem goes away?

You might also want to think about using a debugging kernel that enables various debugging options -- i.e., CONFIG_DEBUG_PAGEALLOC, CONFIG_PAGE_GUARD, CONFIG_DEBUG_STACKOVERFLOW, etc.
Comment 6 Andreas Reis 2014-02-18 14:29:43 UTC
This doesn't happen anymore, so I assume the issues have been fixed.

Sorry about not replying, I was busy and then it slipped my mind.

Note You need to log in before you can comment on or make changes to this bug.