Bug 5210

Summary: crash under disk stress [+4kstacks], Unable to handle kernel paging request at virtual address 08080880 @ do_page_fault
Product: File System Reporter: peter gervai (grin)
Component: XFSAssignee: Diego Calleja (diegocg)
Status: CLOSED CODE_FIX    
Severity: normal CC: alexn, bunk
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.13 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Serial console capture
.config
Oops with requested flag

Description peter gervai 2005-09-09 06:58:41 UTC
[please reassign to the correct component if my guess was not right. might be
ram, disk or xfs related, among other things.]

Problem Description:
Oops: 0000 [#1]
Modules linked in: ipt_LOG iptable_filter ip_tables bttv video_buf
firmware_class i2c_algo_bit v4l2_common btcx_risc tveeprom videodev i
pip ide_cd cdrom uhci_hcd sd_mod sata_sil libata scsi_mod ehci_hcd usbcore
snd_intel8x0 snd_ac97_codec sis_agp agpgart isofs it87 i2c_se
nsor i2c_isa snd_pcm_oss snd_pcm snd_timer snd_page_alloc snd_mixer_oss snd
soundcore tun hisax isdn slhc psmouse i2c_viapro i2c_core ip
v6 8139too sundance mii crc32
CPU:    0
EIP:    0060:[<c0115b0b>]    Not tainted VLI
EFLAGS: 00010002   (2.6.13) 
EIP is at do_page_fault+0xcb/0x6dd
eax: c8b4f000   ebx: 0000000b   ecx: 0000000d   edx: 07070707
esi: 0000000e   edi: c040e2d8   ebp: c8b4f1dc   esp: c8b4f10c
ds: 007b   es: 007b   ss: 0068
Unable to handle kernel paging request at virtual address 08080880
[....]

Lots of oops (attaching to the report) when disk is stressed (usually bonnie++
freezes at "writing intelligently" phase). memtest86+ was running for hours
without error. machine freezes with or without swap. disks are believed to be
good (smart, badblocks) CPU is not overclocked and temp is resonable. 

I cannot point to any hardware fault (not being familiar with kernel internals
this deep), but not impossible. 

Bonnie++ can crash it almost anytime. Did under 2.6.8.1, 2.6.12.* and does under
2.6.13 too.

Distribution: kernel.org
Hardware Environment:
Intel(R) Celeron(R) CPU 2.00GHz, 512M RAM

# lspci
0000:00:00.0 Host bridge: Silicon Integrated Systems [SiS] SiS645DX Host &
Memory & AGP Controller
0000:00:01.0 PCI bridge: Silicon Integrated Systems [SiS] Virtual PCI-to-PCI
bridge (AGP)
0000:00:02.0 ISA bridge: Silicon Integrated Systems [SiS] SiS962 [MuTIOL Media
IO] (rev 04)
0000:00:02.1 SMBus: Silicon Integrated Systems [SiS]: Unknown device 0016
0000:00:02.5 IDE interface: Silicon Integrated Systems [SiS] 5513 [IDE]
0000:00:02.7 Multimedia audio controller: Silicon Integrated Systems [SiS] Sound
Controller (rev a0)
0000:00:03.0 USB Controller: Silicon Integrated Systems [SiS] USB 1.0 Controller
(rev 0f)
0000:00:03.1 USB Controller: Silicon Integrated Systems [SiS] USB 1.0 Controller
(rev 0f)
0000:00:03.2 USB Controller: Silicon Integrated Systems [SiS] USB 1.0 Controller
(rev 0f)
0000:00:03.3 USB Controller: Silicon Integrated Systems [SiS] USB 2.0 Controller
0000:00:08.0 Network controller: Eicon Networks Corporation Diva 2.01 S/T PCI
(rev 01)
0000:00:09.0 RAID bus controller: Silicon Image, Inc. (formerly CMD Technology
Inc) SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02)
0000:00:0a.0 VGA compatible controller: S3 Inc. ViRGE/DX or /GX (rev 01)
0000:00:0c.0 Ethernet controller: D-Link System Inc DL10050 Sundance Ethernet

# mount
/dev/hda7 on / type xfs (rw,usrquota)
proc on /proc type proc (rw,gid=104)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/hdc2 on /var/lib/backuppc type ext3 (rw)
none on /dev type tmpfs (rw,size=5M,mode=0755)
none on /proc/bus/usb type usbfs (rw)
/dev/mapper/vg0-backup on /mnt/lvm/backup type xfs (rw)
/dev/mapper/vg0-db on /mnt/lvm/db type xfs (rw)

Software Environment: Debian GNU/Linux unstable (SID)

Steps to reproduce:
start bonnie++ on / [EIDE] or ..vg0-db/ [SATA], wait, crash.
Comment 1 peter gervai 2005-09-09 07:02:22 UTC
Created attachment 5947 [details]
Serial console capture
Comment 2 Alexander Nyberg 2005-09-10 05:50:57 UTC
Hmpf, are you using CONFIG_4KSTACKS?

Could you pass over your .config please
Comment 3 peter gervai 2005-09-12 01:21:58 UTC
Created attachment 5974 [details]
.config

Yes, this machine happens to use 4Kstacks. Config attached.
(I'll try to crash without 4kstacks, reporting back a bit later.)
Comment 4 peter gervai 2005-09-12 03:23:02 UTC
I switched off 4kstacks.

I cannot freeze the machine anymore. :-/

Okay, nice hint, thank you very much; seems to be fixed (I go and switch off
4kstacks everywhere right now).
I still wonder why, but that's just out of curiousity...
Comment 5 Alexander Nyberg 2005-09-12 03:35:17 UTC
The combination of xfs and device-mapper plus sata (I don't know how
stack-hungry sata is, but it does use the scsi layer). This is unfortunate, but
I'm sure there are people who are interested in this.

What would be very nice is if you could turn on CONFIG_4KSTACKS and turn on
"Stack utilization instrumentation" under "Kernel hacking". That should give us
a trace of what makes the stack overflow (trace will be on the console).

Either that or, describe your environment so that someone else can reproduce the
problem.

Thanks
Comment 6 peter gervai 2005-09-12 05:11:00 UTC
I'll try to get a stack util dump while I'm in the crashing mood. (It is fun to
do on a remote server *wink* *wink* [makes operators finally work for their money].)

BTW crash happened without lvm and sata, too. Maybe it's just harder to
reproduce, I don't know, since I didn't try bonnie++ until I got the new sata
drive, and freezes were intermittent. 
Comment 7 peter gervai 2005-09-12 07:32:13 UTC
Created attachment 5979 [details]
Oops with requested flag

Well, I hope it's helpful, stack trace seems to shoot itself in the foot... I
can switch on any debugging flag (as long as I don't need further magic than
the serial console) if it'd help...
Comment 8 Diego Calleja 2006-07-30 10:23:25 UTC
As far as I'm concerned, by now XFS has solved their stack issues and should be
safe to use with CONFIG_4KSTACKS, so I'm closing this bug. If you can reproduce
this problem with the latest stable kernel version, please reopen it.