Bug 56821 - an ext4 commit ee0906f causes weird disk hangs
Summary: an ext4 commit ee0906f causes weird disk hangs
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-19 12:25 UTC by Tommi Kyntola
Modified: 2018-04-29 16:11 UTC (History)
3 users (show)

See Also:
Kernel Version: 3.8.5
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
A console msg often seen during the hang (628 bytes, text/plain)
2013-04-19 12:25 UTC, Tommi Kyntola
Details
the config used during the bisection (101.04 KB, text/plain)
2013-04-19 12:26 UTC, Tommi Kyntola
Details

Description Tommi Kyntola 2013-04-19 12:25:53 UTC
Created attachment 99301 [details]
A console msg often seen during the hang

The commit (ee0906fc8da3447d168a73570754a160ecbe399b ext4: use s_extent_max_zeroout_kb value as number of kb) causes a strange disk/raid/fs hang for me.

Steps to reproduce:
1) login 
2) startx (I've tried with nv and nvidia)
3) launch thunderbird and wait 3..10 secs

Expected results:
- just another day in the office

Actual results:
- A hang. First I see some refreshes not happening and shortly I can't do anything besides jump from X to consoles and back. I tap something out on those terminals that are still live, but any disk access will hang them, too. The attached console_msg.txt pops out sometimes if I wait long enough. Magic sysrq sync,mount ro, boot is what I do next.

I've used practically every stable release on this box since some time before 3.0 without problems. And ever since 3.8.5 I've been stuck to 3.8.4. Since then I've tried every stable release up to 3.8.8 and none of them work.

The ee0906f commit seems to cause it. I did double checks on surrounding commits, but not more than that. I takes 10 minutes to resync my raid-1 after a failure and that kinda limits my enthusiasm to work it further on my own. No damage seems to be caused by such an event though. The raid sync succeeds every time it only takes a while.

The setup is an updated Fedora 18 on an AMD 4184, 16 Gb ram, LSI SAS controller with two 300GB disks. Three partitions each, first on both is a 50Gb raid1 ext4 as root and second of both is a 100Gb raid1 ext4 as /home. Third partitions are non-raid old ext3 or ext4 filesystems that aren't mounted or used.

I haven't managed to cause the hang when outside of X. I've tried some kernel compiling and catting files to null, but no. Equally while in X (nv or nvidia, doesn't matter) thunderbird seems to trigger it. It launches fully but within a few to ten seconds things start to fail. Another interesting tid bit is that the disk leds in the array both get turned off, which is anomalous. Usually they only blink during access.

I'm willing to provide information and try out things, just let me know what you need.
Comment 1 Tommi Kyntola 2013-04-19 12:26:22 UTC
Created attachment 99311 [details]
the config used during the bisection
Comment 2 Theodore Tso 2013-04-19 17:32:54 UTC
This should allow your system not to crash.

echo 0 > /sys/fs/ext4/<dev>/extent_max_zeroout_kb

The failure which you are showing seems to be one where your SCSI controller and/or your SCSI disks are freaking out when ext4 tries to zero out a block range by calling sb_issue_zeroout().   The block layer will translate this into a TRIM command or a SCSI WRITE SAME command for those devices which support this, so that blocks can be efficiently zeroed out.  

It looks like the block device layer translated this to a standard SCSI WRITE(10) command which is getting issued to both disks at the same time (I assume you are using a software raid via an md device?).   I suspect this is a case where ext4 is enabling a new block device optimization interface, and this is interacting badly with your hardware or your block device driver.

So we need to figure out what is actually causing the feature, so we can some how automatically blacklist whatever is failing.   In the mean time, you can force off the optimization at the ext4 layer by setting extent_max_zeroout_kb to zero.  Hopefully we can figure out a better way of disabling the optimization at a lower level (so you can get the benefits of minimizing extent tree fragmentation without causing your raid array to hang), and some way of disabling some level of optimization or hardware breakage workaround automatically.


mptscsih: ioc0: attempting task abort! (sc=ffff8803ec450f00)
sd 6:0:1:0: [sdb] CDB:
Write(10): 2a 00 12 60 a0 a8 00 00 40 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8803ec450f00)
mptscsih: ioc0: attempting task abort! (sc=ffff8803ec450900)
sd 6:0:0:0: [sda] CDB:
Write(10): 2a 00 12 60 a0 a8 00 00 40 00
mptbase: ioc0: LogInfo(0x31140000): Originator={PL}, Code={IO Executed}, SubCode(0x0000) cb_idx mptscsih_io_done
mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8803ec450900)
Comment 3 Tommi Kyntola 2013-04-20 17:59:25 UTC
Yes to software raid, I thought I mentioned it there already, but evidently I skipped that bit, sorry.

I can strace the thunderbird launch, which is still the only way I've managed to trigger it, but thankfully it triggers it every time, but we'll have to wait until Monday as it's my office box.

You wouldn't happen to have ideas what I should/could try to cause that in a cleaner way? (i.e. should I be generating load with large files or lots of small ones, lots of concurrent action or will a single thread do, etc) 

And what else should I be trying? Stipping down kernel config? Would ext4 debugs help?
Comment 4 Tommi Kyntola 2018-04-29 16:11:19 UTC
Has not been a problem for quite some time and I no longer even have the hardware to reproduce that, so I'm closing this.

Note You need to log in before you can comment on or make changes to this bug.