Bug 25792 - Kernel panic in __mark_inode_dirty (fs-writeback.c: 978)
Summary: Kernel panic in __mark_inode_dirty (fs-writeback.c: 978)
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-12-29 00:10 UTC by Karthick A R
Modified: 2012-08-14 15:14 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.36-rc8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Crash debug logs related to the kernel panic (358.86 KB, text/plain)
2010-12-29 00:10 UTC, Karthick A R
Details

Description Karthick A R 2010-12-29 00:10:24 UTC
Created attachment 41822 [details]
Crash debug logs related to the kernel panic

The panic and all associated crash debug logs are attached along with the dump of the inode, superblock structure and the address_space mapping at the time of the panic from a crash session on the vmcore.

It was first reproduced by my daughter. The trick I am told is to pull the USB cable while the daddy is at work :-)

Its consistently reproducible when I run our product stack with our applications sand-boxed to a USB drive (mount-binded) and pulling the USB drive while they are running. The panic is a result of the inodes bdi (backing device info) pointer going NULL while the inode state is I_DIRTY. The superblocks bdi pointer for the "ext4" superblock type is NULL. The panic in the attachment shows that the kernel was trying to resolve a write-protected page fault on a mmapped page with the USB as the backing storage. There appears to be a race with sd_remove resulting in ext4 superblock bdi being invalidated with a parallel write to the mmapped page in the backing store. Wondering how the ext4 superblocks bdi was invalidated/NULL while the inode's being dirtied.

An effort to reproduce the problem outside our product stack where it is consistently reproducible is not successful yet: https://gist.github.com/757928
Though I did hit UNINTERRUPTIBLE task hang warnings when I run the above test-code and remove the USB drive while the writes were being fired from the test. 
Some of the child processes that write to the disk remained in "D" or uninterruptible state FOREVER after the USB device was forcefully ejected. 
They appear to be stuck in ext4 journalled write. The backtrace for all the uninterruptible tasks are also part of the crash debug attachment.

I believe this issue isn't fixed in 2.6.36 even though I am running a slightly old 2.6.36-rc8 since I don't see any fixes in ext4 or fs-writeback related to the above panic.

Since I am always able to reproduce the panic with our product stack standboxed to the USB device, I can easily verify the patches related to this issue.
Regarding the ext4 uninterruptible task lockup/hangs, its easily reproducible with the test-code in my github that I had mentioned above: https://gist.github.com/757928

I believe this is a major issue considering the backtrace, crash debug logs and the probable race symptoms with sd_remove and ext4 writeback mentioned above.

Note You need to log in before you can comment on or make changes to this bug.