Bug 8414 - soft lockup and filesystem corruption on XFS write (by nfsd)
Summary: soft lockup and filesystem corruption on XFS write (by nfsd)
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: XFS Guru
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-05-01 04:21 UTC by Pallai Roland
Modified: 2008-02-02 02:02 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.21.1,2.6.20.3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
my .config (49.22 KB, text/plain)
2007-05-01 04:23 UTC, Pallai Roland
Details
superblock (842 bytes, text/plain)
2007-05-02 09:21 UTC, Pallai Roland
Details
superblock by xfs_db (732 bytes, text/plain)
2007-05-02 09:22 UTC, Pallai Roland
Details

Description Pallai Roland 2007-05-01 04:21:48 UTC
Most recent kernel where this bug did *NOT* occur: 2.6.20.3
Distribution: Debian Etch
Hardware Environment: Pentium D SMP, 4G RAM, x86_64
Software Environment: NFSv3 serving via UDP from XFS on LVM that's on software 
RAID5 (built from 16 disks, 128k chunk)
Problem Description:
I've upgraded my kernel from 2.6.20.3 to 2.6.21.1 on my NFS server and 1 day 
after the server box stopped NFS serving with the following message:

BUG: soft lockup detected on CPU#1!

Call Trace:
 <IRQ>  [<ffffffff802acdea>] softlockup_tick+0xfa/0x120
 [<ffffffff802122af>] __do_softirq+0x5f/0xd0
 [<ffffffff80295877>] update_process_times+0x57/0x90
 [<ffffffff8027a314>] smp_local_timer_interrupt+0x34/0x60
 [<ffffffff8027a8ab>] smp_apic_timer_interrupt+0x4b/0x80
 [<ffffffff80263696>] apic_timer_interrupt+0x66/0x70
 <EOI>  [<ffffffff881466b0>] :xfs:xfs_iext_idx_to_irec+0x100/0x110
 [<ffffffff88146703>] :xfs:xfs_iext_get_ext+0x43/0x70
 [<ffffffff88146d68>] :xfs:xfs_iext_bno_to_ext+0x138/0x160
 [<ffffffff881248f5>] :xfs:xfs_bmap_search_multi_extents+0x75/0x120
 [<ffffffff88124a13>] :xfs:xfs_bmap_search_extents+0x73/0x120
 [<ffffffff8812a258>] :xfs:xfs_bmapi+0x2b8/0x1150
 [<ffffffff8023c966>] unmap_underlying_metadata+0x6/0x40
 [<ffffffff80217b3b>] __block_commit_write+0x7b/0xc0
 [<ffffffff88170c22>] :xfs:xfs_zero_eof+0xd2/0x190
 [<ffffffff88171a96>] :xfs:xfs_write+0x576/0xb70
 [<ffffffff8026aac9>] _write_lock_bh+0x9/0x20
 [<ffffffff881fbacd>] :nf_conntrack:__nf_ct_refresh_acct+0x3d/0x120
 [<ffffffff8814643d>] :xfs:xfs_iget+0x12d/0x170
 [<ffffffff802236d1>] __up_read+0x21/0xb0
 [<ffffffff8815cd1c>] :xfs:xfs_trans_unlocked_item+0x2c/0x60
 [<ffffffff8816d460>] :xfs:xfs_file_aio_write+0x0/0x60
 [<ffffffff8816d4ba>] :xfs:xfs_file_aio_write+0x5a/0x60
 [<ffffffff802c9273>] do_sync_readv_writev+0xc3/0x110
 [<ffffffff8821c46f>] :exportfs:find_exported_dentry+0x8f/0x530
 [<ffffffff8029f390>] autoremove_wake_function+0x0/0x30
 [<ffffffff804ed14e>] sunrpc_cache_lookup+0x7e/0x160
 [<ffffffff802c910a>] rw_copy_check_uvector+0x8a/0x130
 [<ffffffff802c97a3>] do_readv_writev+0x113/0x230
 [<ffffffff8816cc94>] :xfs:xfs_fs_decode_fh+0xd4/0xe0
 [<ffffffff882232d3>] :nfsd:nfsd_vfs_write+0x113/0x350
 [<ffffffff8021f945>] __dentry_open+0x115/0x1e0
 [<ffffffff882239e7>] :nfsd:nfsd_open+0x157/0x1c0
 [<ffffffff88223cb9>] :nfsd:nfsd_write+0xd9/0x120
 [<ffffffff8822aec0>] :nfsd:nfsd3_proc_write+0x110/0x150
 [<ffffffff8821f25d>] :nfsd:nfsd_dispatch+0xfd/0x1f0
 [<ffffffff804e4c54>] svc_process+0x3e4/0x730
 [<ffffffff8026a432>] __down_read+0x12/0xa2
 [<ffffffff8821f720>] :nfsd:nfsd+0x0/0x2e0
 [<ffffffff8821f8c0>] :nfsd:nfsd+0x1a0/0x2e0
 [<ffffffff80263878>] child_rip+0xa/0x12
 [<ffffffff8821f720>] :nfsd:nfsd+0x0/0x2e0
 [<ffffffff8821f720>] :nfsd:nfsd+0x0/0x2e0
 [<ffffffff8026386e>] child_rip+0x0/0x12


The network graph shows a massive write at that time, I think it was the 
cause.

NFS mounted by the clients with options: 
rw,nosuid,nodev,noexec,vers=3,rsize=32768,wsize=32768,acregmin=1800,acregmax=3600,acdirmin=1800,acdirmax=3600,soft,intr,proto=udp,timeo=14,retrans=2,sec=sys

One of nfsd processes is spinning at 99%, the rest is in state D:
root      4817  3.1  0.0      0     0 ?        R    Apr29  78:46 [nfsd]
root      4818  1.0  0.0      0     0 ?        D    Apr29  25:28 [nfsd]
root      4819  1.0  0.0      0     0 ?        D    Apr29  25:40 [nfsd]
root      4820  1.0  0.0      0     0 ?        D    Apr29  25:31 [nfsd]
root      4821  1.0  0.0      0     0 ?        D    Apr29  25:34 [nfsd]
root      4822  1.0  0.0      0     0 ?        D    Apr29  25:40 [nfsd]
root      4823  1.0  0.0      0     0 ?        D    Apr29  25:32 [nfsd]
root      4824  1.0  0.0      0     0 ?        D    Apr29  25:36 [nfsd]

Steps to reproduce:
I don't know exactly, it's happened on a productive system, I cannot play with 
it, I downgrade it to 2.6.20.3 now. I used default settings in /proc and /sys.
Comment 1 Pallai Roland 2007-05-01 04:23:11 UTC
Created attachment 11361 [details]
my .config
Comment 2 Pallai Roland 2007-05-01 15:40:37 UTC
I must correct myself, and I have some more info on this issue.

First, I did not upgrade from 2.6.20.3, I upgraded from 2.6.15.7. I've 3 NFS 
servers, 2 of them are running 2.6.20.3 for a long time without problems. 
Yesterday, I downgraded the third box (called Sulaco) to 2.6.20.3 from 
2.6.21.1 and I believed the reported problem has gone, but I was wrong. The 
same error occurred with 2.6.20.3 on Sulaco after 1 day! I've got a kernel 
message now that cares about filesystem corruption (I think) just before the 
lockup.

May  1 23:10:45 Sulaco kernel: Filesystem "dm-6": Access to block zero in 
inode 2428580890 star
t_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: f0
May  1 23:10:45 Sulaco kernel: Filesystem "dm-6": Access to block zero in 
inode 2428580890 star
t_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: fd
May  1 23:10:45 Sulaco kernel: Filesystem "dm-6": Access to block zero in 
inode 2428580890 star
t_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: fe
May  1 23:10:45 Sulaco kernel: Filesystem "dm-6": Access to block zero in 
inode 2428580890 star
t_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: fe
May  1 23:10:45 Sulaco kernel: Filesystem "dm-6": Access to block zero in 
inode 2428580890 star
t_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: fe

and then:

May  1 23:10:55 Sulaco kernel: Call Trace:
May  1 23:10:55 Sulaco kernel:  <IRQ>  [softlockup_tick+233/272] 
softlockup_tick+0xe9/0x110
May  1 23:10:55 Sulaco kernel:  [update_process_times+80/144] 
update_process_times+0x50/0x90
May  1 23:10:55 Sulaco kernel:  [smp_local_timer_interrupt+52/96] 
smp_local_timer_interrupt+0x3
4/0x60
May  1 23:10:55 Sulaco kernel:  [smp_apic_timer_interrupt+89/128] 
smp_apic_timer_interrupt+0x59
/0x80
May  1 23:10:55 Sulaco kernel:  [apic_timer_interrupt+102/112] 
apic_timer_interrupt+0x66/0x70
May  1 23:10:55 Sulaco kernel:  <EOI>  
[_end+127398072/2129470264] :xfs:xfs_iext_get_ext+0x0/0x
70
May  1 23:10:55 Sulaco kernel:  
[_end+127274467/2129470264] :xfs:xfs_bmapi+0xdeb/0x1030
May  1 23:10:55 Sulaco kernel:  
[_end+127551052/2129470264] :xfs:xfs_zero_eof+0xe4/0x190
May  1 23:10:55 Sulaco kernel:  
[_end+127552484/2129470264] :xfs:xfs_write+0x4ec/0x9e0
May  1 23:10:55 Sulaco kernel:  [wake_idle+26/176] wake_idle+0x1a/0xb0
May  1 23:10:55 Sulaco kernel:  [activate_task+145/240] 
activate_task+0x91/0xf0
May  1 23:10:55 Sulaco kernel:  [__up_read+117/176] __up_read+0x75/0xb0
May  1 23:10:55 Sulaco kernel:  
[_end+127471524/2129470264] :xfs:xfs_trans_unlocked_item+0x2c/0
x60
May  1 23:10:55 Sulaco kernel:  
[_end+127534520/2129470264] :xfs:xfs_file_aio_write+0x0/0x70
May  1 23:10:55 Sulaco kernel:  
[_end+127534617/2129470264] :xfs:xfs_file_aio_write+0x61/0x70
May  1 23:10:55 Sulaco kernel:  [do_sync_readv_writev+195/272] 
do_sync_readv_writev+0xc3/0x110
May  1 23:10:55 Sulaco kernel:  
[_end+128189601/2129470264] :exportfs:find_exported_dentry+0x89
/0x500
May  1 23:10:55 Sulaco kernel:  [dev_queue_xmit+585/608] 
dev_queue_xmit+0x249/0x260
May  1 23:10:55 Sulaco kernel:  [autoremove_wake_function+0/48] 
autoremove_wake_function+0x0/0x
30
May  1 23:10:55 Sulaco kernel:  [sunrpc_cache_lookup+126/336] 
sunrpc_cache_lookup+0x7e/0x150
May  1 23:10:55 Sulaco kernel:  [rw_copy_check_uvector+112/256] 
rw_copy_check_uvector+0x70/0x10
0
May  1 23:10:55 Sulaco kernel:  [do_readv_writev+229/464] 
do_readv_writev+0xe5/0x1d0
May  1 23:10:55 Sulaco kernel:  
[_end+127533722/2129470264] :xfs:xfs_fs_decode_fh+0xe2/0xf0
May  1 23:10:55 Sulaco kernel:  
[_end+128239116/2129470264] :nfsd:nfsd_setuser+0x154/0x1c0
May  1 23:10:55 Sulaco kernel:  
[_end+128213299/2129470264] :nfsd:fh_verify+0x50b/0x530
May  1 23:10:55 Sulaco kernel:  
[_end+128220538/2129470264] :nfsd:nfsd_vfs_write+0xe2/0x2d0


When I rebooted the machine, I've got a corrupted filesystem! :(

attempt to access beyond end of device
dm-6: rw=1, want=0, limit=2097152000
Buffer I/O error on device dm-6, logical block 18446744073709551615
lost page write due to I/O error on dm-6
XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1590 of file 
fs/xfs/xfs_alloc.c.  Caller 0xffffffff8810b347

Call Trace:
 [<ffffffff8810a457>] :xfs:xfs_free_ag_extent+0x4b7/0x690
 [<ffffffff8810b347>] :xfs:xfs_free_extent+0xa7/0xd0
 [<ffffffff8811a4bb>] :xfs:xfs_bmap_finish+0xeb/0x170
 [<ffffffff88137e8d>] :xfs:xfs_itruncate_finish+0x1ad/0x2d0
 [<ffffffff8815245e>] :xfs:xfs_inactive+0x24e/0x480
 [<ffffffff88160695>] :xfs:xfs_fs_clear_inode+0xa5/0xf0
 [<ffffffff80291f9a>] clear_inode+0xca/0x130
 [<ffffffff80292f7f>] generic_delete_inode+0xdf/0x140
 [<ffffffff802931cd>] iput+0x6d/0x70
 [<ffffffff8028fc4f>] prune_one_dentry+0x6f/0xb0
 [<ffffffff8028fdd4>] prune_dcache+0x144/0x1b0
 [<ffffffff802903e9>] shrink_dcache_memory+0x19/0x50
 [<ffffffff8025f6a7>] shrink_slab+0x107/0x180
 [<ffffffff80260c72>] balance_pgdat+0x242/0x380
 [<ffffffff80243680>] keventd_create_kthread+0x0/0x90
 [<ffffffff80243680>] keventd_create_kthread+0x0/0x90
 [<ffffffff80260ecd>] kswapd+0x11d/0x120
 [<ffffffff80243c10>] autoremove_wake_function+0x0/0x30
 [<ffffffff80243c10>] autoremove_wake_function+0x0/0x30
 [<ffffffff80260db0>] kswapd+0x0/0x120
 [<ffffffff80260db0>] kswapd+0x0/0x120
 [<ffffffff80243642>] kthread+0x92/0xd0
 [<ffffffff8020a5b8>] child_rip+0xa/0x12
 [<ffffffff80243680>] keventd_create_kthread+0x0/0x90
 [<ffffffff802435b0>] kthread+0x0/0xd0
 [<ffffffff8020a5ae>] child_rip+0x0/0x12

xfs_force_shutdown(dm-6,0x8) called from line 4272 of file fs/xfs/xfs_bmap.c.  
Return address = 0xffffffff88161368
Filesystem "dm-6": Corruption of in-memory data detected.  Shutting down 
filesystem: dm-6
Please umount the filesystem, and rectify the problem(s)


What going on here? I consider 2.6.20.3 stable for this job as I use it on the 
another 2 NFS server box for a long time and the setup of this boxes are very 
similar. Maybe hardware problem? Memory failure?? But why the call trace is so 
similar?
Comment 3 Pallai Roland 2007-05-01 16:32:52 UTC
It's not a lucky day for me, I could not mount the corrupted fs:

Starting XFS recovery on filesystem: dm-6 (logdev: internal)
XFS: failed to read root inode

I ran xfs_repair and, in practice, I lost the partition with 1Tb of data. 
Bleh. Now I changed every partition to read-only on this box but I really want 
to know what is going here. TIA.
Comment 4 Dave Chinner 2007-05-02 06:50:25 UTC
The original problem - the soft lockup - is probably a result      
of a massively fragmented file as we are searching the extent      
list when the soft-lockup detector fired. The soft lockup      
detector is not indicative of an actual bug being present, though.     
     
The second problem you report (access to block zero) indicates     
something did go wrong and there is on-disk corruption of an   
extent tree - there are zeros instead of real block numbers.  
ou need to run repair to fix this. 
 
> attempt to access beyond end of device  
> dm-6: rw=1, want=0, limit=2097152000  
> Buffer I/O error on device dm-6, logical block 18446744073709551615  
> lost page write due to I/O error on dm-6  
  
Shows an attempt to write to a block marked as either a hole or delayed  
allocate. You're not having memory errors are you?    
     
Now, the mount error:   
   
    982         error = xfs_iget(mp, NULL, sbp->sb_rootino, 0, XFS_ILOCK_EXCL,   
&rip, 0);   
    983         if (error) {   
    984                 cmn_err(CE_WARN, "XFS: failed to read root inode");   
    985                 goto error3;   
    986         }   
   
Means teh root inode in the superblock is wrong. Something went really 
wrong here. Can you run: 
 
# xfs_db -r -c "sb 0" -c p <dev>  
 
and 
 
# dd if=<dev> bs=512 count=1 | od -Ax 
 
And attach the output so we can see how badly corrupted the superblock 
is? 
 
And finally, repair. What problem did you encounter? Did you just end 
up with everything in lost+found? 
 
Hmmm - this reminds me of problems seen when the filesystem wraps the 
device at 2TB. How big is the LVM volume this partition is on and where is 
it placed? Along the same trainof thought, I'm wondering 
if the device wasn't reconstructed correctly with the new kernel and that 
led to the problem. Were the any other error messages or warnings in the 
syslog from the boot when the problem first happened? 
 
 
      
Comment 5 Pallai Roland 2007-05-02 09:20:01 UTC
Thank you for reply!

> The original problem - the soft lockup - is probably a result
> of a massively fragmented file as we are searching the extent
> list when the soft-lockup detector fired. The soft lockup
> detector is not indicative of an actual bug being present, though.
I was waiting about 10 minutes after the soft lockup, but one of the nfsd 
processes consumed 99% cpu for the whole time till I rebooted into 2.6.20.3. 
The nfsd did not reply to any request after the soft lockup. I think it was a 
real infinite loop/deadlock/Idunno.

>> attempt to access beyond end of device  
>> dm-6: rw=1, want=0, limit=2097152000  
>> Buffer I/O error on device dm-6, logical block 18446744073709551615  
>> lost page write due to I/O error on dm-6  
> Shows an attempt to write to a block marked as either a hole or delayed
> allocate. You're not having memory errors are you?
I ran IO stress test scripts today, but no sign of memory/fs corruption or 
instability. I would like to run memtest86+, but as I said it's a productive 
server and there's still 4Tb data on the remained partitions that's served as 
read-only, I cannot shut down this box for memtest for a long time, easily. 
Anyway, I've no reason to suspect bad memory (except that I loosed 1Tb data 
just now :)

> Can you run: 
> # xfs_db -r -c "sb 0" -c p <dev> 
> and
> # dd if=<dev> bs=512 count=1 | od -Ax 
> And attach the output so we can see how badly corrupted the superblock 
> is? 
I know, I did a mistake when I ran the 'xfs_repair -L' without thinking. Lot 
of valuable debug info has been lost. :(  The superblock dump is attached, but 
it's the corrected one, not the corrupted.

> And finally, repair. What problem did you encounter? Did you just end 
> up with everything in lost+found? 
Yes, and I've also got empty directories, that wasn't on the original 
partition, I'm sure. But 'du -hs lost+found' said everything is there.

> Hmmm - this reminds me of problems seen when the filesystem wraps the 
> device at 2TB. How big is the LVM volume this partition is on and where is 
> it placed?

Sulaco:~# vgdisplay -v vg1
  [...]
  Metadata Areas        1
  Metadata Sequence No  5
  [...]
  Total PE              89426
  Alloc PE / Size       89426 / 5.46 TB
  Free  PE / Size       0 / 0

Sulaco:~# lvdisplay -m /dev/vg1/wlv
  [...]
  LV Size                1000.00 GB
  Current LE             16000
  Segments               1
  [...]
  --- Segments ---
  Logical extent 0 to 15999:
    Type                linear
    Physical volume     /dev/md0
    Physical extents    32000 to 47999

Sulaco:~# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4]
md0 : active raid5 sda1[0] sdg1[15] sdh1[14] sdf1[13] sde1[12] sdp1[11] sdo1
[10] sdn1[9] sdm1[8] sdl1[7] sdk1[6] sdj1[5] sdi1[4] sdd1[3] sdc1[2] sdb1[1]
      5860631040 blocks level 5, 128k chunk, algorithm 2 [16/16] 
[UUUUUUUUUUUUUUUU]
      [=============>.......]  resync = 66.2% (258980992/390708736) 
finish=1451.1min speed=1511K/sec

It's placed middle of the VG and only 1Tb. There's a 2.5Tb partition on this 
box too and there's no problem, I checked all volumes by xfs_check today.

> Along the same trainof thought, I'm wondering if the device wasn't
> reconstructed correctly with the new kernel and that led to the problem.
> Were the any other error messages or warnings in the syslog from the
> boot when the problem first happened? 
No, nothing. The MD was complete and worked for 1 day in the first case, and 
after the reboot there wasn't rebuild.


I'm doing a stress test like this over NFS from morning:

while :; do
  for i in `seq 1 100`; do
    rm $i
    dd if=/dev/zero of=$i bs=32k count=800 &
  done
  for i in `seq 1 100`; do
    if ! md5sum $i | grep -q <md5>; then
       # does not happened, yet
       echo FATAL; exit 1
    fi
  done
done

No errors, yet. It's not "that" partition, but has same size and file count. 
After I mined lost+found for valuable data, I'll start this test on that 
partition too.

I don't know what's the problem, but the similar call traces on soft lockup 
are very suspicious for me. I never seen such message on my servers before, 
and generally there's no mass fragmentation (checked by xfs_bmap) and no too 
high load.
Comment 6 Pallai Roland 2007-05-02 09:21:26 UTC
Created attachment 11374 [details]
superblock
Comment 7 Pallai Roland 2007-05-02 09:22:03 UTC
Created attachment 11375 [details]
superblock by xfs_db
Comment 8 Natalie Protasevich 2007-07-07 00:09:08 UTC
Any update on the problem?
Thanks.
Comment 9 Pallai Roland 2007-09-03 03:59:44 UTC
I'm using 2.6.20 from this issue, I won't use .21 on this box again, I'll jump to .22 on this week.
Comment 10 Natalie Protasevich 2007-11-07 23:05:56 UTC
Pallai, did you have chance to try latest kernel?
Comment 11 Natalie Protasevich 2008-02-02 02:02:05 UTC
Since no update on this problem for a while now, I'm closing it.
Please reopen if you get back to it.

Note You need to log in before you can comment on or make changes to this bug.