Most recent kernel where this bug did not occur:unknown Distribution:Centos 4.2 S390x + Hardware Environment:IBM Z900 model 104 Software Environment:Linux running in LPAR mode (no VM) Problem Description:System hangs with following message on the hardware console 05:32:36 localhost kernel: crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid= Kernel Level is localhost kernel: Linux version 2.6.9-34.EL (builder@c4s390x) (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) 0000001 SMP the rsc=B indicates the interupt is from the channel subsystem RMF indicates LPAR using 100% CPU. There are no indications of a hardware error on any of the MVS LPARs Steps to reproduce:
This Bugzilla only covers bugs in kernels that are both: - unmodified ftp.kernel.org kernels and - recent kernels (IOW: >= 2.6.16). Please either reproduce with an unmodified 2.6.16 kernel from ftp.kernel.org or report this issue to the vendor of your kernel.
Installed kernel 2.6.16.15 from kernel.org, the machine check interupt now causes the following message in var/log/messages. The systems then goes into a recursive failure loop. Additional messages appear on the HMC bu do not appear to get logged. I am trying to find a way other the pen and paper to capture the HMC messages Jim Suhr May 14 20:23:11 localhost kernel: BUG: soft lockup detected on CPU#0! May 14 20:23:11 localhost kernel: CPU: 0 Not tainted May 14 20:23:11 localhost kernel: Process kmcheck (pid: 10436, task: 000000000f8db908, ksp: 000000000d8d7d20)
Additional documentation the following hang happened when a chipid was configured online on the MVS LPARS. May 23 14:19:56 localhost kernel: crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0 May 23 14:19:56 localhost kernel: BUG: soft lockup detected on CPU#0! May 23 14:19:56 localhost kernel: BUG: soft lockup detected on CPU#0! May 23 14:19:56 localhost kernel: CPU: 0 Not tainted May 23 14:19:56 localhost kernel: CPU: 0 Not tainted May 23 14:19:56 localhost kernel: Process kmcheck (pid: 10510, task: 000000000f906048, ksp: 000000000df0fd20) May 23 14:19:56 localhost kernel: Process kmcheck (pid: 10510, task: 000000000f906048, ksp: 000000000df0fd20) May 23 14:19:56 localhost kernel: Krnl PSW : 0704000180000000 00000000002056d2 (_spin_unlock+0x1a/0x28) May 23 14:19:56 localhost kernel: Krnl PSW : 0704000180000000 00000000002056d2 (_spin_unlock+0x1a/0x28) May 23 14:19:56 localhost kernel: Krnl GPRS: 00000000fffffff9 ffffffff00000000 0000000000205447 0000000000231460 May 23 14:19:56 localhost kernel: Krnl GPRS: 00000000fffffff9 ffffffff00000000 0000000000205447 0000000000231460 May 23 14:19:56 localhost kernel: 00000000002056d2 0000000000000020 000000000000000b 0000000000000000 May 23 14:19:56 localhost kernel: 00000000002056d2 0000000000000020 000000000000000b 0000000000000000 May 23 14:19:56 localhost kernel: 000000000df0fe48 0000000000374078 000000000f5e4d68 000000000df0fd28 May 23 14:19:56 localhost kernel: 000000000df0fe48 0000000000374078 000000000f5e4d68 000000000df0fd28 May 23 14:19:56 localhost kernel: 000000000f5e4d70 000000000021c9b8 00000000002056d2 000000000df0fc10 May 23 14:19:56 localhost kernel: 000000000f5e4d70 000000000021c9b8 00000000002056d2 000000000df0fc10 May 23 14:19:56 localhost kernel: Krnl Code: e3 40 f0 a8 00 04 eb ef f0 a8 00 04 07 f4 eb ef f0 88 00 24 May 23 14:19:56 localhost kernel: Krnl Code: e3 40 f0 a8 00 04 eb ef f0 a8 00 04 07 f4 eb ef f0 88 00 24 May 23 14:19:56 localhost kernel: Call Trace: May 23 14:19:56 localhost kernel: Call Trace: May 23 14:19:56 localhost kernel: ([<00000000002056d2>] _spin_unlock+0x1a/0x28) May 23 14:19:56 localhost kernel: ([<00000000002056d2>] _spin_unlock+0x1a/0x28) May 23 14:19:56 localhost kernel: [<000000000020244c>] klist_next+0x84/0x98 May 23 14:19:56 localhost kernel: [<000000000020244c>] klist_next+0x84/0x98 May 23 14:19:56 localhost kernel: [<000000000014cd82>] next_device+0x1a/0x40 May 23 14:19:56 localhost kernel: [<000000000014cd82>] next_device+0x1a/0x40 May 23 14:19:58 localhost kernel: [<000000000014cea4>] bus_find_device+0x70/0x98 May 23 14:19:58 localhost kernel: [<000000000014cea4>] bus_find_device+0x70/0x98 May 23 14:20:42 localhost kernel: [<000000000016584a>] get_subchannel_by_schid+0x32/0x58 May 23 14:20:42 localhost kernel: [<000000000016584a>] get_subchannel_by_schid+0x32/0x58 May 23 14:20:42 localhost kernel: [<00000000001623aa>] __s390_process_res_acc+0x26/0x1d4 May 23 14:20:42 localhost kernel: [<00000000001623aa>] __s390_process_res_acc+0x26/0x1d4 May 23 14:20:43 localhost kernel: [<0000000000165588>] for_each_subchannel+0x44/0xac May 23 14:20:43 localhost kernel: [<0000000000165588>] for_each_subchannel+0x44/0xac May 23 14:20:43 localhost kernel: [<0000000000163764>] chsc_process_crw+0x4e0/0x544 May 23 14:20:43 localhost kernel: [<0000000000163764>] chsc_process_crw+0x4e0/0x544
Problem: Linux system had access to ~5000 non-linux I/O devices. When a machine check occured, recovery procedure for these devices would delay timer interrupt processing for a prolonged time, causing the soft-lockup mechanism to trigger. The recovery procedure also caused holtplug events for each checked device. This resulted in a massive amount of hotplug agent processes to be started which eventually triggered the out-of-memory killer and brought normal operations to a halt. Solution: Disable soft-lockup mechanism on s390 systems. Also minimize the number of I/O devices visible to Linux by using the cio_ignore kernel parameter. This parameter is documented on page 333 of the "Drivers, Features, and Commands - SC33-8289-01" document which can be found at: http://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26cdd01.pdf