Date: Mon, 21 Jan 2008 09:40:59 +0800 From: william cheng <opensourcewill@gmail.com> To: linux-kernel@vger.kernel.org Subject: modprobing ipmi_si on Dell Power Edge 2600 make the terminal hang kernel ver. >= 2.6.20 Dear all, We got some problem on modprobing the ipmi_si module on Dell Power Edge 2600. On modprobing the ipmi_si the terminal hang and the process cannot be terminated by control-C. We got these messages in dmesg ipmi message handler version 39.1 IPMI System Interface driver. ipmi_si: Trying SMBIOS-specified bt state machine at i/o address 0xe4, slave address 0x20, irq 10 Using irq 10 IPMI BT: req2rsp=10 secs retries=3 ipmi_si: Error clearing flags: c6 We have tested a few other dell machine (Power Edge 2500, 2900, ....), and this problem only occurs in Power Edge 2600. The problem occurs in the following testes Debian testing lenny Ubuntu 7.10 Ubuntu 7.04 Fedore 8 Kernel 2.6.20 Kernel 2.6.21 Kernel 2.6.22 Kernel 2.6.22.5 Kernel 2.6.22.14 Kernel 2.6.22.15 Kernel 2.6.23.12 While in the following testes we can successfully modprobe the ipmi_si Ubuntu 6.10 Debian 4.0r2 Kernel 2.6.18.3 Kernel 2.6.18.5 Kernel 2.6.18.8 Kernel 2.6.19 Kernel 2.6.19.7 It looks like the problem only occurs when using kernel after 2.6.20 with Power Edge 2600. Anyone know the reason for that? Any additional information should I provide? Thanks William Latest working kernel version: 2.6.19 Earliest failing kernel version: 2.6.20 Distribution: Debian testing lenny Ubuntu 7.10 Ubuntu 7.04 Fedore 8 Kernel 2.6.20 Kernel 2.6.21 Kernel 2.6.22 Kernel 2.6.22.5 Kernel 2.6.22.14 Kernel 2.6.22.15 Kernel 2.6.23.12 Hardware Environment: Dell PowerEdge 2600 Software Environment: Problem Description: see above Steps to reproduce: modprobe ipmi_si
The error C6 is from the BMC and it is "Request data truncated", whatever that means. However, the driver should still work and not hang; I don't think that's the root of the problem. How long did you wait? It might take a few minutes to fail. Start it and go to lunch and see if it returns before you are back :). Is anything happening with IRQ 10? The only change I see that might cause a problem is commit id 4d7cbac7c870ca66d8fb27d68188efbb5de2dffa, "IPMI: Fix BT long busy". It might be worth reverting that patch and seeing what happens.
Not sure how it is supposed to be related to hardware monitoring?
Hi Corey Minyard, (In reply to comment #1) > How long did you wait? It might take a few minutes to fail. Start it and go > to lunch and see if it returns before you are back :). > Just do an additional experiment. Using ubuntu 7.10 live CD Run the command #date > test.txt; modprobe ipmi_si; date >> test.txt And have a big lunch :P ... after 4.5 hours it still don't return. > Is anything happening with IRQ 10? I think IRQ 10 is not used in that machine thanks William
As pointed out in Comment#1, we modified the patch ("IPMI: Fix BT long busy") by removing the changes in ipmi_bt_sm.c file and applied the rest of the patch as is. After these modifications, ipmi driver loads successfully without any issues. Thanks Srinivas.
bugme-daemon@bugzilla.kernel.org wrote: > Just do an additional experiment. Using ubuntu 7.10 live CD > Run the command > #date > test.txt; modprobe ipmi_si; date >> test.txt > And have a big lunch :P ... after 4.5 hours it still don't return. > Ok, that means it's hung :). > >> Is anything happening with IRQ 10? >> > I think IRQ 10 is not used in that machine > Uhmm... ipmi_si: Trying SMBIOS-specified bt state machine at i/o address 0xe4, slave address 0x20, irq 10 Using irq 10 It seems like the IPMI driver should be using IRQ 10.
bugme-daemon@bugzilla.kernel.org wrote: > ------- Comment #4 from wwlinuxengineering@dell.com 2008-01-23 03:23 ------- > As pointed out in Comment#1, we modified the patch ("IPMI: Fix BT long busy") > by removing the changes in ipmi_bt_sm.c file and applied the rest of the > patch > as is. > After these modifications, ipmi driver loads successfully without any issues. > That is actually not pointed out in comment #1 (probably was in the email thread before I saw it), but it is good information. I guess it's time to do some debugging there. Can you load the module and add "bt_debug=7" to the modprobe/insmod line? It may be time to pull Rocky Craig into the conversation, too.
Created attachment 14565 [details] /var/log/messages containing logs when ipmi_si driver was loaded with bt_debug.
There's several weird things going on (inferring from the log attachment): 1. The upper layer is starting the kernel thread to do polling ("Read event message buffer") even though the hardware looks like it supports interrupts. 2. "Using interrupts" messages are coming out at unexpected times. 3. Something is requesting "Get Device GUID" early in the initialization. I don't see it in ipmi_si_intf.c, could it be coming from the message handler? And why is it happening so soon? 4. The Device GUID response is getting mangled somehow, maybe there's some race conditions with interrupt enabling? Anyhow, the "main" thread doesn't seem to be returning from initialization, even though the polling thread is merrily chugging away. I don't actually have a PE 2600 so I'm not sure where to go next.
Adding this as an attachment generated CGI errors, so here is an abbreviated, mildly annotated version of the debug output. I'm wondering if the hard failure of the "Get Event Message Buffer" is what's blocking the completion of the initialization phase, and thus hanging the insmod? 38:19 ipmi message handler version 39.1 38:19 ipmi device interface 38:50 IPMI System Interface driver. 38:50 ipmi_si: Trying SMBIOS-specified bt state machine at to address 0xe4, slave address 0x20, irq 10 38:50 IPMI BT: flag reset [ SMS ] BT subdriver entry 38:50 BT: +++++++++++++++++ New command ipmi_si layer init 38:50 BT: NetFn/LUN CMD [0 data]: 18 01 Get Device ID 38:50 BT: XACTION [ ] TO=5000000 - 0 38:50 BT: WR_BYTES [ ] TO=5000000 - 0 38:50 BT: write 4 bytes seq=0x01 03 18 00 01 seq=xxxx is one-off 38:50 BT: WR_CONSUME [ H2B ] TO=5000000 - 0 38:50 BT: WR_CONSUME [ ] TO=5000000 - 100 38:50 BT: RD_WAIT [ ] TO=4999900 - 0 38:50 BT: RD_WAIT [ ] TO=4999900 - 100 38:50 BT: RD_WAIT [ ] TO=4999800 - 100 38:50 BT: RD_WAIT [ ] TO=4999700 - 100 38:50 BT: RD_WAIT [ ] TO=4999600 - 100 38:50 BT: RD_WAIT [ ] TO=4999500 - 100 38:50 BT: RD_WAIT [ ] TO=4999400 - 100 38:50 BT: RD_WAIT [ ] TO=4999300 - 100 38:50 BT: RD_WAIT [ ] TO=4999200 - 100 38:50 BT: RD_WAIT [ B2H ] TO=4999100 - 100 38:50 BT: CLEAR_B2H [ H_BUSY ] TO=4999000 - 0 38:50 BT: RD_BYTES [ H_BUSY ] TO=4999000 - 0 38:50 BT: got 16 bytes seq=0x00 0f 1c 00 01 00 00 80 01 seq=xxxx is correct 84 01 1f a2 02 00 04 00 38:50 BT: result 14 bytes: 1c 01 00 00 80 01 84 01 1f a2 02 00 04 00 38:50 Using irq 10 odd placement... 38:50 BT: IDLE [ ] TO=4999000 - 0 38:50 BT: CAP_BEGIN [ ] TO=4999000 - 0 Started from first IDLE state 38:50 BT: +++++++++++++++++ New command Get BT Interface Capabilities 38:50 BT: NetFn/LUN CMD [0 data]: 18 36 38:50 BT: XACTION [ ] TO=5000000 - 10000 38:50 BT: WR_BYTES [ ] TO=4990000 - 0 38:50 BT: write 4 bytes seq=0x02 03 18 01 36 38:51 BT: WR_CONSUME [ H2B ] TO=4990000 - 0 38:51 BT: WR_CONSUME [ ] TO=4990000 - 11000 38:51 BT: RD_WAIT [ ] TO=4979000 - 0 38:51 BT: RD_WAIT [ B2H ] TO=4979000 - 11000 38:51 BT: CLEAR_B2H [ H_BUSY ] TO=4968000 - 0 38:51 BT: RD_BYTES [ H_BUSY ] TO=4968000 - 0 38:51 BT: got 10 bytes seq=0x01 09 1c 01 36 00 01 40 40 0a 03 38:51 BT: CAP_END [ ] TO=4968000 - 0 38:51 BT: result 8 bytes: 1c 36 00 01 40 40 0a 03 38:51 IPMI BT: req2rsp=10 secs retries=3 IO buffers = 0x40 bytes 38:51 BT: IDLE [ ] TO=10000000 - 0 38:51 BT: +++++++++++++++++ New command WHO DID THIS? 38:51 BT: NetFn/LUN CMD [0 data]: 18 08 Get Device GUID 38:51 BT: XACTION [ ] TO=10000000 - 0 38:51 BT: WR_BYTES [ ] TO=10000000 - 0 38:51 BT: write 4 bytes seq=0x01 03 18 00 08 38:51 BT: WR_CONSUME [ H2B ] TO=10000000 - 0 38:51 BT: WR_CONSUME [ ] TO=10000000 - 11000 38:51 BT: RD_WAIT [ ] TO=9989000 - 0 38:51 BT: RD_WAIT [ B2H ] TO=9989000 - 11000 38:51 BT: CLEAR_B2H [ H_BUSY ] TO=9978000 - 0 38:51 BT: RD_BYTES [ H_BUSY ] TO=9978000 - 0 38:51 BT: got 21 bytes seq=0x00 14 1c 00 08 00 00 00 00 00 00 00 00 00 00 00 00 ... cCode == 0, looks like like 16 bytes of zeros 38:51 BT: result 3 bytes: 1c 08 c6 ERROR/CORRUPTION COMES IN HERE??? 38:51 ipmi_si: Error clearing flags: c6 I've seen this kind of thing before, the c6 is leftover from some other operation when interrupts were not timed right...can't quite remember it. 38:51 BT: +++++++++++++++++ New command ipmi_si layer init 38:51 BT: NetFn/LUN CMD [0 data]: 18 2f Get BMC Global Enables 38:51 BT: XACTION [ ] TO=10000000 - 0 38:51 BT: WR_BYTES [ ] TO=10000000 - 11000 38:51 BT: write 4 bytes seq=0x02 03 18 01 2f 38:51 BT: WR_CONSUME [ H2B ] TO=9989000 - 0 38:51 BT: WR_CONSUME [ B2H ] TO=9989000 - 11000 38:51 BT: RD_WAIT [ B2H ] TO=9978000 - 0 38:51 BT: CLEAR_B2H [ H_BUSY ] TO=9978000 - 0 38:51 BT: RD_BYTES [ H_BUSY ] TO=9978000 - 0 38:51 BT: got 6 bytes seq=0x01 05 1c 01 2f 00 09 38:51 BT: result 4 bytes: 1c 2f 00 09 38:51 BT: +++++++++++++++++ New command ipmi_si layer init 38:51 BT: NetFn/LUN CMD [1 data]: 18 2e 09 Set BMC Global Enables 38:51 BT: XACTION [ ] TO=10000000 - 0 38:51 BT: WR_BYTES [ ] TO=10000000 - 11000 38:52 BT: write 5 bytes seq=0x03 04 18 02 2e 09 38:52 BT: WR_CONSUME [ H2B ] TO=9989000 - 0 38:52 BT: WR_CONSUME [ B2H ] TO=9989000 - 11000 38:52 BT: RD_WAIT [ B2H ] TO=9978000 - 0 38:52 BT: CLEAR_B2H [ H_BUSY ] TO=9978000 - 0 38:52 BT: RD_BYTES [ H_BUSY ] TO=9978000 - 0 38:52 BT: got 5 bytes seq=0x02 04 1c 02 2e 00 38:52 BT: result 3 bytes: 1c 2e 00 38:52 BT: IDLE [ ] TO=9978000 - 0 38:52 BT: IDLE [ ] TO=10000000 - 11000 38:52 last message repeated 28 times 38:52 BT: IDLE [ ] TO=10000000 - 10000 38:52 BT: IDLE [ ] TO=10000000 - 11000 38:52 last message repeated 34 times 38:52 BT: +++++++++++++++++ New command ipmi_si layer polling thread? 38:52 BT: NetFn/LUN CMD [0 data]: 18 35 Read Event Message buffer 38:52 BT: XACTION [ ] TO=10000000 - 0 38:52 BT: WR_BYTES [ ] TO=10000000 - 0 38:52 BT: write 4 bytes seq=0x04 03 18 03 35 38:52 BT: WR_CONSUME [ H2B ] TO=10000000 - 0 38:52 BT: WR_CONSUME [ B2H ] TO=10000000 - 11000 38:52 BT: RD_WAIT [ B2H ] TO=9989000 - 0 38:52 BT: CLEAR_B2H [ H_BUSY ] TO=9989000 - 0 38:52 BT: RD_BYTES [ H_BUSY ] TO=9989000 - 0 38:52 BT: got 5 bytes seq=0x03 04 1c 03 35 ce cCode == CE: Response could 38:52 BT: result 3 bytes: 1c 35 ce not be provided 38:52 BT: IDLE [ ] TO=9989000 - 0 38:52 BT: IDLE [ ] TO=10000000 - 11000 38:52 last message repeated 53 times 38:52 BT: IDLE [ ] TO=10000000 - 12000 38:52 BT: IDLE [ ] TO=10000000 - 10000 38:52 BT: IDLE [ ] TO=10000000 - 11000 38:53 last message repeated 33 times 38:53 BT: +++++++++++++++++ New command Ad infinitum 38:53 BT: NetFn/LUN CMD [0 data]: 18 35 Read Event Message buffer 38:53 BT: XACTION [ ] TO=10000000 - 0 38:53 BT: WR_BYTES [ ] TO=10000000 - 0 38:53 BT: write 4 bytes seq=0x05 03 18 04 35 38:53 BT: WR_CONSUME [ H2B ] TO=10000000 - 0 38:53 BT: WR_CONSUME [ B2H ] TO=10000000 - 11000 38:53 BT: RD_WAIT [ B2H ] TO=9989000 - 0 38:53 BT: CLEAR_B2H [ H_BUSY ] TO=9989000 - 0 38:53 BT: RD_BYTES [ H_BUSY ] TO=9989000 - 0 38:53 BT: got 5 bytes seq=0x04 04 1c 04 35 ce 38:53 BT: result 3 bytes: 1c 35 ce : : :
If I may, I'd like to add a strange finding that may help in finding the cause of this issue or at least may be slightly related. I found that the ipmi module loaded fine in CentOS 5 with kernel 2.6.18-53 with a 2 year old RAID firmware. After we upgraded the firmware to the latest Critical update for the PE2600 in question the ipmi module hangs in a continuous loop. I understand that this bug was reported for kernels >=2.6.20 but this is the second PE2600 we've had this problem on although we thought we were good to go until we upgraded the RAID firmware for the PERC 4/Di.
ipmi message handler version 39.1 IPMI System Interface driver. ipmi_si: Trying SMBIOS-specified bt state machine at i/o address 0xe4, slave address 0x20, irq 15 Using irq 15 IPMI BT: req2rsp=10 secs retries=3 ipmi_si: Error clearing flags: c6 ipmi device interface CentOS 5 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:30:20 EST 2008 i686 i686 i386 GNU/Linux My PE6650 server got same problem hangs forever.
The following patch fixes the issue: diff -purN base/ipmi_si_intf.c rhel51/ipmi_si_intf.c --- base/ipmi_si_intf.c 2008-04-01 14:08:34.000000000 -0400 +++ rhel51/ipmi_si_intf.c 2008-04-01 14:00:46.000000000 -0400 @@ -947,10 +947,6 @@ static int smi_start_processing(void new_smi->intf = intf; - /* Try to claim any interrupts. */ - if (new_smi->irq_setup) - new_smi->irq_setup(new_smi); - /* Set up the timer that drives the interface. */ setup_timer(&new_smi->si_timer, smi_timeout, (long)new_smi); new_smi->last_timeout_jiffies = jiffies; @@ -2620,6 +2616,10 @@ static int try_smi_init(struct smi_info setup_oem_data_handler(new_smi); setup_xaction_handlers(new_smi); + /* Try to claim any interrupts. */ + if (new_smi->irq_setup) + new_smi->irq_setup(new_smi); + INIT_LIST_HEAD(&(new_smi->xmit_msgs)); INIT_LIST_HEAD(&(new_smi->hp_xmit_msgs)); new_smi->curr_msg = NULL; This patch is available as a part of 2.6.24 kernel upsteam and has been backported to RHEL 5.2(kernel 2.6.18-92).