Bug 15025
Summary: | Oops in ext4 driver | ||
---|---|---|---|
Product: | File System | Reporter: | Steinar H. Gunderson (steinar+kernel) |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | CLOSED UNREPRODUCIBLE | ||
Severity: | high | CC: | dmonakhov, rjw, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33-rc3 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 14885 |
Description
Steinar H. Gunderson
2010-01-10 13:09:58 UTC
Since you marked it as a regression, what was the last working kernel? On Wed, Jan 13, 2010 at 10:09:32PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #1 from Rafael J. Wysocki <rjw@sisk.pl> 2010-01-13 22:09:30 --- > Since you marked it as a regression, what was the last working kernel? I upgraded from 2.6.32-rc5, where I didn't see this issue. 2.6.32.3 works fine on the same machine and filesystem. /* Steinar */ On Sunday 24 January 2010, Steinar H. Gunderson wrote:
> On Sun, Jan 24, 2010 at 11:04:35PM +0100, Rafael J. Wysocki wrote:
> > The following bug entry is on the current list of known regressions
> > from 2.6.32. Please verify if it still should be listed and let me know
> > (either way).
>
> I'm not using 2.6.33 anymore since this bug is a showstopper to me (it's on a
> production system), so I'm unable to check if it's fixed or not.
Hi Steiner, Sorry for not getting back to you right away; I've been doing a huge amount of travel right during January. Can you tell me something about the file system workload on your machine? What does it do? NFS, rsync server, backups, ...? And do you know what it might be doing right before it crashed? How easily can you reproduce this? I take it since you had to stop using 2.6.33-rcX you could reproduce it easily? If you are willing to try a 2.6.33-rcX kernel, I'd suggest seeing if "echo 0 > /sys/fs/ext4/<dev>/max_writeback_mb_bump" makes the crashes go away. On Wed, Jan 27, 2010 at 07:35:11PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > Sorry for not getting back to you right away; I've been doing a huge amount > of > travel right during January. Can you tell me something about the file > system > workload on your machine? What does it do? NFS, rsync server, backups, ...? IIRC this was a file system that was mainly used for video storage and transcoding -- I think I was encoding a video with x264 to it when it crashed. Apart from that the machine spends most of its I/O time doing web serving from relatively large (1-2TB) data sets, and occasionally rtorrent. It was recently online expanded, so I thought that might be related, but the problem persisted after a reboot and a forced fsck, so there was no on-disk corruption involved. > And do you know what it might be doing right before it crashed? How easily > can you reproduce this? I take it since you had to stop using 2.6.33-rcX you > could reproduce it easily? It crashed two times in two days or something after I upgraded to 2.6.33-rcX. Not a statistically huge sample, I'm afraid. > If you are willing to try a 2.6.33-rcX kernel, I'd suggest seeing if "echo 0 > > > /sys/fs/ext4/<dev>/max_writeback_mb_bump" makes the crashes go away. I'm afraid it's not so easy for me to do reboots into new kernels on this machine; kernel upgrades generally happen when the machine is booted for some other reason. :-/ /* Steinar */ There was a power drop (too long for the UPS), so I've now run 2.6.33-rc8 on this same machine for about 24 hours without seeing any ext4 errors. The load is probably different, though, but at least it doesn't seem to bite me anymore. The issue was fixed by following commit http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commit;h=1db913823c0f8360fccbd24ca67eb073966a5ffd Test case: dmon$ sudo mount /dev/sdd /mnt -oquota dmon$ set-quota-limit /mnt id=dmon --bsoft=1000 --bsoft=1000 dmon$ dd if=/dev/zefo of=/mnt/file Please close the bug Thanks, closing. Dmitry: That seems impossible, as I'm not using quota on the machine in question (it's not even compiled into the kernel). On Tue, Feb 16, 2010 at 02:08:33PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > The issue was fixed by following commit > > http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commit;h=1db913823c0f8360fccbd24ca67eb073966a5ffd This cannot be, as I don't use quota. /* Steinar */ After the patch i can not trigger the bug (In reply to comment #10) > On Tue, Feb 16, 2010 at 02:08:33PM +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: > > The issue was fixed by following commit > > > http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commit;h=1db913823c0f8360fccbd24ca67eb073966a5ffd > > This cannot be, as I don't use quota. It i is posible to triger the bug without quota. Untill the patch it we have following code fs/ext4/inode.c: 1850 if (ext4_claim_free_blocks(sbi, md_needed + 1)) { 1851 vfs_dq_release_reservation_block(inode, md_needed + 1); 1852 if (ext4_should_retry_alloc(inode->i_sb, &retries)) { 1853 retry: 1854 if (md_reserved) 1855 write_inode_now(inode, (retries == 3)); ^^^^^^^^^^ Here we goes in to lack of journal credits. ^^^^^^^^^^^^^^^^^^^^^^ 1856 yield(); 1857 goto repeat; 1858 } 1859 return -ENOSPC; 1860 } You have failed exactly here. So the bug happens even in case of ENOSPC (try following testase): dd if=/dev/zero /mnt/BIG_FILE bs=1M But it takes longer if partition is really huge. Since calling "write_inode_now" from ext4_da_get_block_prep was the core of the issue. And the patch move it to an upper level. So the issue was completely fixed. Please close the bug as CODE_FIXED |