Most recent kernel where this bug did not occur: unsure Distribution:SLES 9 Hardware Environment:xseries - x305 Software Environment:SLES 9 w/ Linux version 2.6.13-rc6-mm1 Problem Description: After running tests for a couple of days I got this: BUG: soft lockup detected on CPU#0! Pid: 3545, comm: md0_raid1 EIP: 0060:[<c027ddb3>] CPU: 0 EIP is at ide_intr+0xc3/0x100 EFLAGS: 00000246 Not tainted (2.6.13-rc6-mm1) EAX: 00000001 EBX: dfc1a980 ECX: 06b3b555 EDX: 00000007 ESI: c0455b88 EDI: 00000246 EBP: c0455af8 DS: 007b ES: 007b CR0: 8005003b CR2: b79850d8 CR3: 04b58000 CR4: 000006d0 [<c0282550>] task_out_intr+0x0/0xd0 [<c013a492>] handle_IRQ_event+0x32/0x70 [<c013a525>] __do_IRQ+0x55/0xa0 [<c0105128>] do_IRQ+0x38/0x60 [<c0103c5a>] common_interrupt+0x1a/0x20 [<c0147730>] __blk_queue_bounce+0x210/0x280 [<c02562fd>] __make_request+0x5d/0x4a0 [<c0256a4e>] generic_make_request+0x12e/0x200 [<c012d2d0>] autoremove_wake_function+0x0/0x30 [<c013a55c>] __do_IRQ+0x8c/0xa0 [<c013a543>] __do_IRQ+0x73/0xa0 [<c012d2d0>] autoremove_wake_function+0x0/0x30 [<c0103c5a>] common_interrupt+0x1a/0x20 [<f883cba3>] raid1d+0x103/0x230 [raid1] [<c0316552>] schedule_timeout+0x92/0xa0 [<c028f8cc>] md_thread+0x11c/0x160 [<c012d2d0>] autoremove_wake_function+0x0/0x30 [<c0102bd6>] ret_from_fork+0x6/0x20 [<c012d2d0>] autoremove_wake_function+0x0/0x30 [<c028f7b0>] md_thread+0x0/0x160 [<c0101319>] kernel_thread_helper+0x5/0xc Steps to reproduce: The tests that I ran were iozone from the STP testsuite and fsstress from the LTP testsuite. I ran them concurrently on the same filesystem - /testfs. Filesystem Type 1K-blocks Used Available Use% Mounted on /dev/hda7 reiserfs 21229172 3031088 18198084 15% / tmpfs tmpfs 647900 8 647892 1% /dev/shm /dev/mapper/system-resultslv xfs 20961280 120680 20840600 1% /results /dev/md0 ext2 11361308 8 10784172 1% /testfs /dev/mapper/system-testfs2 ext3 11594976 5219340 5786632 48% /testfs2 /dev/hda5 jfs 11509344 1540 11507804 1% /testfs3 /dev/mapper/system-testfs4lv reiserfs 8388348 32840 8355508 1% /testfs4 /dev/hda2 reiserfs 3148632 248640 2899992 8% /tests /dev/hda6 ext2 1035660 180 982872 1% /tests/bash-memory This is a software RAID device. If this bug belongs in a different area, please let me know where it belongs.
Created attachment 5820 [details] dmesg output
Created attachment 5821 [details] /var/log/messages
This problem is still occurring on 2.6.14-rc2-mm1.
Methinks this is an IDE crash.
PIO transfers are slow, does it happen with NMI watchdog disabled?
Some more analysis: - DMA timed out on hdc and transfer mode was switched to PIO and it (falsely?) triggered NMI watchdog - hda and hdc seem to be identical (but hda has clipped capacity to 40GB) so this is quite interesting and may help in debugging the problem [ I'll audit serverworks.c driver later ] Is the DMA timeout bug reproducible? Is it always hdc?
No longer reproducible.
I've got this bug just now. Here's my output from dmesg: BUG: soft lockup detected on CPU#0! Pid: 27122, comm: ocssd.bin EIP: 0061:[<c0153738>] CPU: 0 EIP is at follow_page+0x128/0x1c0 EFLAGS: 00000246 Not tainted (2.6.16-xen #1) EAX: 00000000 EBX: 004110a0 ECX: 55555555 EDX: e0c61784 ESI: 00000000 EDI: 0000000e EBP: b4d3a000 DS: 007b ES: 007b CR0: 8005003b CR2: b5c400f8 CR3: 04f03000 CR4: 00000640 [<c0153900>] get_user_pages+0x130/0x340 [<c019aeef>] elf_core_dump+0x99f/0xb29 [<c0176ee0>] do_coredump+0x250/0x2d4 [<c012bdf1>] __dequeue_signal+0x71/0xa0 [<c012bea9>] dequeue_signal+0x89/0x110 [<c012e053>] get_signal_to_deliver+0x313/0x3a0 [<c0104dc1>] do_signal+0x71/0x170 [<c012cc74>] kill_proc_info+0x54/0x80 [<c012e16d>] sigprocmask+0x5d/0x120 [<c012e2ba>] sys_rt_sigprocmask+0x8a/0x140 [<c0104ef8>] do_notify_resume+0x38/0x3c [<c0105031>] work_notifysig+0x13/0x1a I've been running this machine without problems for about three weeks. The kernel was compiled manually as part of a Xen install (this happened on a DomU host)
While I agree there are similarities, this is really a totally different bug. I recommend you open a new bug report with this informatio so that it gets assigned properly.
closing based on comment #7