Bug 5159 - BUG: soft lockup detected on CPU#0!
BUG: soft lockup detected on CPU#0!
Product: IO/Storage
Classification: Unclassified
Component: MD
i386 Linux
: P2 normal
Assigned To: io_md
Depends on:
  Show dependency treegraph
Reported: 2005-08-30 12:46 UTC by David Gardner
Modified: 2007-05-17 15:05 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.13-rc6-mm1
Tree: Mainline
Regression: ---

dmesg output (17.33 KB, text/plain)
2005-08-30 12:47 UTC, David Gardner
/var/log/messages (56.80 KB, text/plain)
2005-08-30 12:49 UTC, David Gardner

Description David Gardner 2005-08-30 12:46:29 UTC
Most recent kernel where this bug did not occur: unsure
Distribution:SLES 9
Hardware Environment:xseries - x305
Software Environment:SLES 9 w/ Linux version 2.6.13-rc6-mm1
Problem Description:
After running tests for a couple of days I got this:

BUG: soft lockup detected on CPU#0!

Pid: 3545, comm:            md0_raid1
EIP: 0060:[<c027ddb3>] CPU: 0
EIP is at ide_intr+0xc3/0x100
 EFLAGS: 00000246    Not tainted  (2.6.13-rc6-mm1)
EAX: 00000001 EBX: dfc1a980 ECX: 06b3b555 EDX: 00000007
ESI: c0455b88 EDI: 00000246 EBP: c0455af8 DS: 007b ES: 007b
CR0: 8005003b CR2: b79850d8 CR3: 04b58000 CR4: 000006d0
 [<c0282550>] task_out_intr+0x0/0xd0
 [<c013a492>] handle_IRQ_event+0x32/0x70
 [<c013a525>] __do_IRQ+0x55/0xa0
 [<c0105128>] do_IRQ+0x38/0x60
 [<c0103c5a>] common_interrupt+0x1a/0x20
 [<c0147730>] __blk_queue_bounce+0x210/0x280
 [<c02562fd>] __make_request+0x5d/0x4a0
 [<c0256a4e>] generic_make_request+0x12e/0x200
 [<c012d2d0>] autoremove_wake_function+0x0/0x30
 [<c013a55c>] __do_IRQ+0x8c/0xa0
 [<c013a543>] __do_IRQ+0x73/0xa0
 [<c012d2d0>] autoremove_wake_function+0x0/0x30
 [<c0103c5a>] common_interrupt+0x1a/0x20
 [<f883cba3>] raid1d+0x103/0x230 [raid1]
 [<c0316552>] schedule_timeout+0x92/0xa0
 [<c028f8cc>] md_thread+0x11c/0x160
 [<c012d2d0>] autoremove_wake_function+0x0/0x30
 [<c0102bd6>] ret_from_fork+0x6/0x20
 [<c012d2d0>] autoremove_wake_function+0x0/0x30
 [<c028f7b0>] md_thread+0x0/0x160
 [<c0101319>] kernel_thread_helper+0x5/0xc

Steps to reproduce:
The tests that I ran were iozone from the STP testsuite and fsstress from the
LTP testsuite. I ran them concurrently on the same filesystem - /testfs.

Filesystem    Type   1K-blocks      Used Available Use% Mounted on
/dev/hda7 reiserfs    21229172   3031088  18198084  15% /
tmpfs        tmpfs      647900         8    647892   1% /dev/shm
               xfs    20961280    120680  20840600   1% /results
/dev/md0      ext2    11361308         8  10784172   1% /testfs
              ext3    11594976   5219340   5786632  48% /testfs2
/dev/hda5      jfs    11509344      1540  11507804   1% /testfs3
          reiserfs     8388348     32840   8355508   1% /testfs4
/dev/hda2 reiserfs     3148632    248640   2899992   8% /tests
/dev/hda6     ext2     1035660       180    982872   1% /tests/bash-memory

This is a software RAID device. If this bug belongs in a different area, please
let me know where it belongs.
Comment 1 David Gardner 2005-08-30 12:47:12 UTC
Created attachment 5820 [details]
dmesg output
Comment 2 David Gardner 2005-08-30 12:49:02 UTC
Created attachment 5821 [details]
Comment 3 David Gardner 2005-10-11 11:22:23 UTC
This problem is still occurring on 2.6.14-rc2-mm1.
Comment 4 Andrew Morton 2006-01-16 01:35:48 UTC
Methinks this is an IDE crash.
Comment 5 Bartlomiej Zolnierkiewicz 2006-01-16 02:42:45 UTC
PIO transfers are slow, does it happen with NMI watchdog disabled?
Comment 6 Bartlomiej Zolnierkiewicz 2006-01-16 02:51:10 UTC
Some more analysis:
- DMA timed out on hdc and transfer mode was switched to PIO and it (falsely?)
  triggered NMI watchdog
- hda and hdc seem to be identical (but hda has clipped capacity to 40GB)
  so this is quite interesting and may help in debugging the problem
  [ I'll audit serverworks.c driver later ]

Is the DMA timeout bug reproducible?  Is it always hdc?
Comment 7 David Gardner 2006-03-21 13:18:31 UTC
No longer reproducible.
Comment 8 Mark Barrett 2006-10-25 09:37:10 UTC
I've got this bug just now. Here's my output from dmesg:

BUG: soft lockup detected on CPU#0!

Pid: 27122, comm:            ocssd.bin
EIP: 0061:[<c0153738>] CPU: 0
EIP is at follow_page+0x128/0x1c0
 EFLAGS: 00000246    Not tainted  (2.6.16-xen #1)
EAX: 00000000 EBX: 004110a0 ECX: 55555555 EDX: e0c61784
ESI: 00000000 EDI: 0000000e EBP: b4d3a000 DS: 007b ES: 007b
CR0: 8005003b CR2: b5c400f8 CR3: 04f03000 CR4: 00000640
 [<c0153900>] get_user_pages+0x130/0x340
 [<c019aeef>] elf_core_dump+0x99f/0xb29
 [<c0176ee0>] do_coredump+0x250/0x2d4
 [<c012bdf1>] __dequeue_signal+0x71/0xa0
 [<c012bea9>] dequeue_signal+0x89/0x110
 [<c012e053>] get_signal_to_deliver+0x313/0x3a0
 [<c0104dc1>] do_signal+0x71/0x170
 [<c012cc74>] kill_proc_info+0x54/0x80
 [<c012e16d>] sigprocmask+0x5d/0x120
 [<c012e2ba>] sys_rt_sigprocmask+0x8a/0x140
 [<c0104ef8>] do_notify_resume+0x38/0x3c
 [<c0105031>] work_notifysig+0x13/0x1a

I've been running this machine without problems for about three weeks.
The kernel was compiled manually as part of a Xen install (this happened on a
DomU host)
Comment 9 Neil Brown 2006-10-29 19:27:01 UTC
While I agree there are similarities, this is really a totally different bug.
I recommend you open a new bug report with this informatio so that
it gets assigned properly.
Comment 10 Dave Jones 2007-05-17 15:05:12 UTC
closing based on comment #7

Note You need to log in before you can comment on or make changes to this bug.