Bug 6524 - s390x System hangs/loops after displaying Machine Check Handler Diagnostic Message on HMC
Summary: s390x System hangs/loops after displaying Machine Check Handler Diagnostic Me...
Status: REJECTED DOCUMENTED
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Peter Oberparleiter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-05-09 10:54 UTC by Jim Suhr
Modified: 2006-08-24 01:11 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.16.15
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description Jim Suhr 2006-05-09 10:54:17 UTC
Most recent kernel where this bug did not occur:unknown 
Distribution:Centos 4.2 S390x +
Hardware Environment:IBM Z900 model 104
Software Environment:Linux running in LPAR mode (no VM)
Problem Description:System hangs with following message on the hardware console
05:32:36 localhost kernel: crw_info : CRW reports slct=0, oflw=0, chn=0, 
rsc=B, anc=0, erc=0, rsid=
Kernel Level is 
localhost kernel: Linux version 2.6.9-34.EL (builder@c4s390x) (gcc version 
3.4.5 20051201 (Red Hat 3.4.5-2)) 0000001 SMP
the rsc=B indicates the interupt is from the channel subsystem 
RMF indicates LPAR using 100% CPU.
There are no indications of a hardware error on any of the MVS LPARs

Steps to reproduce:
Comment 1 Adrian Bunk 2006-05-09 16:01:34 UTC
This Bugzilla only covers bugs in kernels that are both:
- unmodified ftp.kernel.org kernels and
- recent kernels (IOW: >= 2.6.16).

Please either reproduce with an unmodified 2.6.16 kernel from ftp.kernel.org or
report this issue to the vendor of your kernel.
Comment 2 Jim Suhr 2006-05-15 05:15:42 UTC
Installed kernel 2.6.16.15 from kernel.org, the machine check interupt now 
causes the following message in var/log/messages. The systems then goes into a 
recursive failure loop. Additional messages appear on the HMC bu do not appear 
to get logged. I am trying to find a way other the pen and paper to capture 
the HMC messages
Jim Suhr
May 14 20:23:11 localhost kernel: BUG: soft lockup detected on CPU#0!
May 14 20:23:11 localhost kernel: CPU:    0    Not tainted
May 14 20:23:11 localhost kernel: Process kmcheck (pid: 10436, task: 
000000000f8db908, ksp: 000000000d8d7d20)
Comment 3 Jim Suhr 2006-05-24 05:33:31 UTC
Additional documentation the following hang happened when a chipid was 
configured online on the MVS LPARS.
May 23 14:19:56 localhost kernel: crw_info : CRW reports slct=0, oflw=0, 
chn=0, rsc=B, anc=0, erc=0, rsid=0
May 23 14:19:56 localhost kernel: BUG: soft lockup detected on CPU#0!
May 23 14:19:56 localhost kernel: BUG: soft lockup detected on CPU#0!
May 23 14:19:56 localhost kernel: CPU:    0    Not tainted
May 23 14:19:56 localhost kernel: CPU:    0    Not tainted
May 23 14:19:56 localhost kernel: Process kmcheck (pid: 10510, task: 
000000000f906048, ksp: 000000000df0fd20)
May 23 14:19:56 localhost kernel: Process kmcheck (pid: 10510, task: 
000000000f906048, ksp: 000000000df0fd20)
May 23 14:19:56 localhost kernel: Krnl PSW : 0704000180000000 00000000002056d2 
(_spin_unlock+0x1a/0x28)
May 23 14:19:56 localhost kernel: Krnl PSW : 0704000180000000 00000000002056d2 
(_spin_unlock+0x1a/0x28)
May 23 14:19:56 localhost kernel: Krnl GPRS: 00000000fffffff9 ffffffff00000000 
0000000000205447 0000000000231460
May 23 14:19:56 localhost kernel: Krnl GPRS: 00000000fffffff9 ffffffff00000000 
0000000000205447 0000000000231460
May 23 14:19:56 localhost kernel:            00000000002056d2 0000000000000020 
000000000000000b 0000000000000000
May 23 14:19:56 localhost kernel:            00000000002056d2 0000000000000020 
000000000000000b 0000000000000000
May 23 14:19:56 localhost kernel:            000000000df0fe48 0000000000374078 
000000000f5e4d68 000000000df0fd28
May 23 14:19:56 localhost kernel:            000000000df0fe48 0000000000374078 
000000000f5e4d68 000000000df0fd28
May 23 14:19:56 localhost kernel:            000000000f5e4d70 000000000021c9b8 
00000000002056d2 000000000df0fc10
May 23 14:19:56 localhost kernel:            000000000f5e4d70 000000000021c9b8 
00000000002056d2 000000000df0fc10
May 23 14:19:56 localhost kernel: Krnl Code: e3 40 f0 a8 00 04 eb ef f0 a8 00 
04 07 f4 eb ef f0 88 00 24 
May 23 14:19:56 localhost kernel: Krnl Code: e3 40 f0 a8 00 04 eb ef f0 a8 00 
04 07 f4 eb ef f0 88 00 24 
May 23 14:19:56 localhost kernel: Call Trace:
May 23 14:19:56 localhost kernel: Call Trace:
May 23 14:19:56 localhost kernel: ([<00000000002056d2>] _spin_unlock+0x1a/0x28)
May 23 14:19:56 localhost kernel: ([<00000000002056d2>] _spin_unlock+0x1a/0x28)
May 23 14:19:56 localhost kernel:  [<000000000020244c>] klist_next+0x84/0x98
May 23 14:19:56 localhost kernel:  [<000000000020244c>] klist_next+0x84/0x98
May 23 14:19:56 localhost kernel:  [<000000000014cd82>] next_device+0x1a/0x40
May 23 14:19:56 localhost kernel:  [<000000000014cd82>] next_device+0x1a/0x40
May 23 14:19:58 localhost kernel:  [<000000000014cea4>] 
bus_find_device+0x70/0x98
May 23 14:19:58 localhost kernel:  [<000000000014cea4>] 
bus_find_device+0x70/0x98
May 23 14:20:42 localhost kernel:  [<000000000016584a>] 
get_subchannel_by_schid+0x32/0x58
May 23 14:20:42 localhost kernel:  [<000000000016584a>] 
get_subchannel_by_schid+0x32/0x58
May 23 14:20:42 localhost kernel:  [<00000000001623aa>] 
__s390_process_res_acc+0x26/0x1d4
May 23 14:20:42 localhost kernel:  [<00000000001623aa>] 
__s390_process_res_acc+0x26/0x1d4
May 23 14:20:43 localhost kernel:  [<0000000000165588>] 
for_each_subchannel+0x44/0xac
May 23 14:20:43 localhost kernel:  [<0000000000165588>] 
for_each_subchannel+0x44/0xac
May 23 14:20:43 localhost kernel:  [<0000000000163764>] 
chsc_process_crw+0x4e0/0x544
May 23 14:20:43 localhost kernel:  [<0000000000163764>] 
chsc_process_crw+0x4e0/0x544
Comment 4 Peter Oberparleiter 2006-08-24 01:11:24 UTC
Problem:

Linux system had access to ~5000 non-linux I/O devices. When a machine check
occured, recovery procedure for these devices would delay timer interrupt
processing for a prolonged time, causing the soft-lockup mechanism to trigger.

The recovery procedure also caused holtplug events for each checked device. This
resulted in a massive amount of hotplug agent processes to be started which
eventually triggered the out-of-memory killer and brought normal operations to a
halt.

Solution:

Disable soft-lockup mechanism on s390 systems. Also minimize the number of I/O
devices visible to Linux by using the cio_ignore kernel parameter.

This parameter is documented on page 333 of the "Drivers, Features, and Commands
- SC33-8289-01" document which can be found at:
 
http://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26cdd01.pdf

Note You need to log in before you can comment on or make changes to this bug.