Bug 9793 - PowerEdge 2600 modprobe ipmi_si hangs
Summary: PowerEdge 2600 modprobe ipmi_si hangs
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Hardware Monitoring (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Dell Linux Engineering
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-22 06:58 UTC by Matt Domsch
Modified: 2008-09-22 10:36 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.20 - 2.6.23.12
Tree: Mainline
Regression: Yes


Attachments
/var/log/messages containing logs when ipmi_si driver was loaded with bt_debug. (203.08 KB, application/msword)
2008-01-24 09:40 UTC, Dell Linux Engineering
Details

Description Matt Domsch 2008-01-22 06:58:59 UTC
Date:   Mon, 21 Jan 2008 09:40:59 +0800
From: william cheng <opensourcewill@gmail.com>
To: linux-kernel@vger.kernel.org
Subject: modprobing ipmi_si on Dell Power Edge 2600 make the terminal hang kernel ver. >= 2.6.20

Dear all,

        We got some problem on modprobing the ipmi_si module on Dell
Power Edge 2600.
        On modprobing the ipmi_si the terminal hang and the process
cannot be terminated by control-C.
        We got these messages in dmesg

        ipmi message handler version 39.1
        IPMI System Interface driver.
        ipmi_si: Trying SMBIOS-specified bt state machine at i/o
address 0xe4, slave address 0x20, irq 10
          Using irq 10
        IPMI BT: req2rsp=10 secs retries=3
        ipmi_si: Error clearing flags: c6


        We have tested a few other dell machine (Power Edge 2500,
2900, ....), and this problem only occurs in Power Edge 2600.


        The problem occurs in the following testes
        Debian testing lenny
        Ubuntu 7.10
        Ubuntu 7.04
        Fedore 8
        Kernel 2.6.20
        Kernel 2.6.21
        Kernel 2.6.22
        Kernel 2.6.22.5
        Kernel 2.6.22.14
        Kernel 2.6.22.15
        Kernel 2.6.23.12

        While in the following testes we can successfully modprobe the ipmi_si
        Ubuntu 6.10
        Debian 4.0r2
        Kernel 2.6.18.3
        Kernel 2.6.18.5
        Kernel 2.6.18.8
        Kernel 2.6.19
        Kernel 2.6.19.7

        It looks like the problem only occurs when using kernel after
2.6.20 with Power Edge 2600.
        Anyone know the reason for that?
        Any additional information should I provide?
        Thanks
William


Latest working kernel version:
2.6.19

Earliest failing kernel version:
2.6.20

Distribution:
        Debian testing lenny
        Ubuntu 7.10
        Ubuntu 7.04
        Fedore 8
        Kernel 2.6.20
        Kernel 2.6.21
        Kernel 2.6.22
        Kernel 2.6.22.5
        Kernel 2.6.22.14
        Kernel 2.6.22.15
        Kernel 2.6.23.12


Hardware Environment:
Dell PowerEdge 2600

Software Environment:
Problem Description:
see above

Steps to reproduce:
modprobe ipmi_si
Comment 1 Corey Minyard 2008-01-22 07:32:46 UTC
The error C6 is from the BMC and it is "Request data truncated", whatever that means.  However, the driver should still work and not hang; I don't think that's the root of the problem.

How long did you wait?  It might take a few minutes to fail.  Start it and go to lunch and see if it returns before you are back :).

Is anything happening with IRQ 10?

The only change I see that might cause a problem is commit id 4d7cbac7c870ca66d8fb27d68188efbb5de2dffa, "IPMI: Fix BT long busy".  It might be worth reverting that patch and seeing what happens.
Comment 2 Jean Delvare 2008-01-22 09:52:24 UTC
Not sure how it is supposed to be related to hardware monitoring?
Comment 3 william 2008-01-23 00:28:20 UTC
Hi Corey Minyard,
(In reply to comment #1)
> How long did you wait?  It might take a few minutes to fail.  Start it and go
> to lunch and see if it returns before you are back :).
> 
Just do an additional experiment. Using ubuntu 7.10 live CD
Run the command
#date > test.txt; modprobe ipmi_si; date >> test.txt
And have a big lunch :P ... after 4.5 hours it still don't return.

> Is anything happening with IRQ 10?
I think IRQ 10 is not used in that machine

thanks
William
Comment 4 Dell Linux Engineering 2008-01-23 03:23:08 UTC
As pointed out in Comment#1, we modified the patch ("IPMI: Fix BT long busy") by removing the changes in ipmi_bt_sm.c file and applied the rest of the patch as is.
After these modifications, ipmi driver loads successfully without any issues.

Thanks
Srinivas.
Comment 5 Corey Minyard 2008-01-23 06:13:59 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> Just do an additional experiment. Using ubuntu 7.10 live CD
> Run the command
> #date > test.txt; modprobe ipmi_si; date >> test.txt
> And have a big lunch :P ... after 4.5 hours it still don't return.
>   
Ok, that means it's hung :).

>   
>> Is anything happening with IRQ 10?
>>     
> I think IRQ 10 is not used in that machine
>   
Uhmm...

        ipmi_si: Trying SMBIOS-specified bt state machine at i/o
address 0xe4, slave address 0x20, irq 10
          Using irq 10

It seems like the IPMI driver should be using IRQ 10.
Comment 6 Corey Minyard 2008-01-23 06:17:15 UTC
bugme-daemon@bugzilla.kernel.org wrote:
> ------- Comment #4 from wwlinuxengineering@dell.com  2008-01-23 03:23 -------
> As pointed out in Comment#1, we modified the patch ("IPMI: Fix BT long busy")
> by removing the changes in ipmi_bt_sm.c file and applied the rest of the
> patch
> as is.
> After these modifications, ipmi driver loads successfully without any issues.
>   
That is actually not pointed out in comment #1 (probably was in the 
email thread before I saw it), but it is good information.  I guess it's 
time to do some debugging there.

Can you load the module and add "bt_debug=7" to the modprobe/insmod 
line?  It may be time to pull Rocky Craig into the conversation, too.
Comment 7 Dell Linux Engineering 2008-01-24 09:40:08 UTC
Created attachment 14565 [details]
/var/log/messages containing logs when ipmi_si driver was loaded with bt_debug.
Comment 8 Rocky Craig 2008-02-14 13:11:24 UTC
There's several weird things going on (inferring from the log attachment):

1. The upper layer is starting the kernel thread to do polling ("Read event message buffer") even though the hardware looks like it supports interrupts.
2. "Using interrupts" messages are coming out at unexpected times.
3. Something is requesting "Get Device GUID" early in the initialization.  I don't see it in ipmi_si_intf.c, could it be coming from the message handler?  And why is it happening so soon?
4. The Device GUID response is getting mangled somehow, maybe there's some race conditions with interrupt enabling?

Anyhow, the "main" thread doesn't seem to be returning from initialization, even  though the polling thread is merrily chugging away.

I don't actually have a PE 2600 so I'm not sure where to go next.
Comment 9 Rocky Craig 2008-02-14 21:02:15 UTC
Adding this as an attachment generated CGI errors, so here is an abbreviated, mildly annotated version of the debug output.  I'm wondering if the hard failure of the "Get Event Message Buffer" is what's blocking the completion of the initialization phase, and thus hanging the insmod?

38:19 ipmi message handler version 39.1
38:19 ipmi device interface
38:50 IPMI System Interface driver.
38:50 ipmi_si: Trying SMBIOS-specified bt state machine at
                to address 0xe4, slave address 0x20, irq 10

38:50 IPMI BT: flag reset [ SMS ]               BT subdriver entry

38:50 BT: +++++++++++++++++ New command         ipmi_si layer init
38:50 BT: NetFn/LUN CMD [0 data]: 18 01         Get Device ID
38:50 BT: XACTION [ ] TO=5000000 - 0
38:50 BT: WR_BYTES [ ] TO=5000000 - 0
38:50 BT: write 4 bytes seq=0x01 03 18 00 01    seq=xxxx is one-off
38:50 BT: WR_CONSUME [ H2B ] TO=5000000 - 0
38:50 BT: WR_CONSUME [ ] TO=5000000 - 100
38:50 BT: RD_WAIT [ ] TO=4999900 - 0
38:50 BT: RD_WAIT [ ] TO=4999900 - 100
38:50 BT: RD_WAIT [ ] TO=4999800 - 100
38:50 BT: RD_WAIT [ ] TO=4999700 - 100
38:50 BT: RD_WAIT [ ] TO=4999600 - 100
38:50 BT: RD_WAIT [ ] TO=4999500 - 100
38:50 BT: RD_WAIT [ ] TO=4999400 - 100
38:50 BT: RD_WAIT [ ] TO=4999300 - 100
38:50 BT: RD_WAIT [ ] TO=4999200 - 100
38:50 BT: RD_WAIT [ B2H ] TO=4999100 - 100
38:50 BT: CLEAR_B2H [ H_BUSY ] TO=4999000 - 0
38:50 BT: RD_BYTES [ H_BUSY ] TO=4999000 - 0
38:50 BT: got 16 bytes seq=0x00 0f 1c 00 01 00 00 80 01  seq=xxxx is correct
                                84 01 1f a2 02 00 04 00
38:50 BT: result 14 bytes: 1c 01 00 00 80 01 84 01 1f a2 02 00 04 00

38:50   Using irq 10                            odd placement...
38:50 BT: IDLE [ ] TO=4999000 - 0

38:50 BT: CAP_BEGIN [ ] TO=4999000 - 0          Started from first IDLE state

38:50 BT: +++++++++++++++++ New command         Get BT Interface Capabilities
38:50 BT: NetFn/LUN CMD [0 data]: 18 36
38:50 BT: XACTION [ ] TO=5000000 - 10000
38:50 BT: WR_BYTES [ ] TO=4990000 - 0
38:50 BT: write 4 bytes seq=0x02 03 18 01 36
38:51 BT: WR_CONSUME [ H2B ] TO=4990000 - 0
38:51 BT: WR_CONSUME [ ] TO=4990000 - 11000
38:51 BT: RD_WAIT [ ] TO=4979000 - 0
38:51 BT: RD_WAIT [ B2H ] TO=4979000 - 11000
38:51 BT: CLEAR_B2H [ H_BUSY ] TO=4968000 - 0
38:51 BT: RD_BYTES [ H_BUSY ] TO=4968000 - 0
38:51 BT: got 10 bytes seq=0x01 09 1c 01 36 00 01 40 40 0a 03
38:51 BT: CAP_END [ ] TO=4968000 - 0
38:51 BT: result 8 bytes: 1c 36 00 01 40 40 0a 03
38:51 IPMI BT: req2rsp=10 secs retries=3                IO buffers = 0x40 bytes
38:51 BT: IDLE [ ] TO=10000000 - 0

38:51 BT: +++++++++++++++++ New command                 WHO DID THIS?
38:51 BT: NetFn/LUN CMD [0 data]: 18 08                 Get Device GUID
38:51 BT: XACTION [ ] TO=10000000 - 0
38:51 BT: WR_BYTES [ ] TO=10000000 - 0
38:51 BT: write 4 bytes seq=0x01 03 18 00 08
38:51 BT: WR_CONSUME [ H2B ] TO=10000000 - 0
38:51 BT: WR_CONSUME [ ] TO=10000000 - 11000
38:51 BT: RD_WAIT [ ] TO=9989000 - 0
38:51 BT: RD_WAIT [ B2H ] TO=9989000 - 11000
38:51 BT: CLEAR_B2H [ H_BUSY ] TO=9978000 - 0
38:51 BT: RD_BYTES [ H_BUSY ] TO=9978000 - 0
38:51 BT: got 21 bytes seq=0x00 14 1c 00 08 00 00 00 00 00
                                00 00 00 00 00 00 00 ...

        cCode == 0, looks like like 16 bytes of zeros

38:51 BT: result 3 bytes: 1c 08 c6      ERROR/CORRUPTION COMES IN HERE???
38:51 ipmi_si: Error clearing flags: c6

I've seen this kind of thing before, the c6 is leftover from some other
operation when interrupts were not timed right...can't quite remember it.

38:51 BT: +++++++++++++++++ New command         ipmi_si layer init
38:51 BT: NetFn/LUN CMD [0 data]: 18 2f         Get BMC Global Enables
38:51 BT: XACTION [ ] TO=10000000 - 0
38:51 BT: WR_BYTES [ ] TO=10000000 - 11000
38:51 BT: write 4 bytes seq=0x02 03 18 01 2f
38:51 BT: WR_CONSUME [ H2B ] TO=9989000 - 0
38:51 BT: WR_CONSUME [ B2H ] TO=9989000 - 11000
38:51 BT: RD_WAIT [ B2H ] TO=9978000 - 0
38:51 BT: CLEAR_B2H [ H_BUSY ] TO=9978000 - 0
38:51 BT: RD_BYTES [ H_BUSY ] TO=9978000 - 0
38:51 BT: got 6 bytes seq=0x01 05 1c 01 2f 00 09
38:51 BT: result 4 bytes: 1c 2f 00 09

38:51 BT: +++++++++++++++++ New command         ipmi_si layer init
38:51 BT: NetFn/LUN CMD [1 data]: 18 2e 09      Set BMC Global Enables
38:51 BT: XACTION [ ] TO=10000000 - 0
38:51 BT: WR_BYTES [ ] TO=10000000 - 11000
38:52 BT: write 5 bytes seq=0x03 04 18 02 2e 09
38:52 BT: WR_CONSUME [ H2B ] TO=9989000 - 0
38:52 BT: WR_CONSUME [ B2H ] TO=9989000 - 11000
38:52 BT: RD_WAIT [ B2H ] TO=9978000 - 0
38:52 BT: CLEAR_B2H [ H_BUSY ] TO=9978000 - 0
38:52 BT: RD_BYTES [ H_BUSY ] TO=9978000 - 0
38:52 BT: got 5 bytes seq=0x02 04 1c 02 2e 00
38:52 BT: result 3 bytes: 1c 2e 00
38:52 BT: IDLE [ ] TO=9978000 - 0
38:52 BT: IDLE [ ] TO=10000000 - 11000
38:52 last message repeated 28 times
38:52 BT: IDLE [ ] TO=10000000 - 10000
38:52 BT: IDLE [ ] TO=10000000 - 11000
38:52 last message repeated 34 times

38:52 BT: +++++++++++++++++ New command         ipmi_si layer polling thread?
38:52 BT: NetFn/LUN CMD [0 data]: 18 35         Read Event Message buffer
38:52 BT: XACTION [ ] TO=10000000 - 0
38:52 BT: WR_BYTES [ ] TO=10000000 - 0
38:52 BT: write 4 bytes seq=0x04 03 18 03 35
38:52 BT: WR_CONSUME [ H2B ] TO=10000000 - 0
38:52 BT: WR_CONSUME [ B2H ] TO=10000000 - 11000
38:52 BT: RD_WAIT [ B2H ] TO=9989000 - 0
38:52 BT: CLEAR_B2H [ H_BUSY ] TO=9989000 - 0
38:52 BT: RD_BYTES [ H_BUSY ] TO=9989000 - 0
38:52 BT: got 5 bytes seq=0x03 04 1c 03 35 ce   cCode == CE: Response could
38:52 BT: result 3 bytes: 1c 35 ce                           not be provided
38:52 BT: IDLE [ ] TO=9989000 - 0
38:52 BT: IDLE [ ] TO=10000000 - 11000
38:52 last message repeated 53 times
38:52 BT: IDLE [ ] TO=10000000 - 12000
38:52 BT: IDLE [ ] TO=10000000 - 10000
38:52 BT: IDLE [ ] TO=10000000 - 11000
38:53 last message repeated 33 times

38:53 BT: +++++++++++++++++ New command         Ad infinitum
38:53 BT: NetFn/LUN CMD [0 data]: 18 35         Read Event Message buffer
38:53 BT: XACTION [ ] TO=10000000 - 0
38:53 BT: WR_BYTES [ ] TO=10000000 - 0
38:53 BT: write 4 bytes seq=0x05 03 18 04 35
38:53 BT: WR_CONSUME [ H2B ] TO=10000000 - 0
38:53 BT: WR_CONSUME [ B2H ] TO=10000000 - 11000
38:53 BT: RD_WAIT [ B2H ] TO=9989000 - 0
38:53 BT: CLEAR_B2H [ H_BUSY ] TO=9989000 - 0
38:53 BT: RD_BYTES [ H_BUSY ] TO=9989000 - 0
38:53 BT: got 5 bytes seq=0x04 04 1c 04 35 ce
38:53 BT: result 3 bytes: 1c 35 ce
 :
 :
 :
Comment 10 Kyle 2008-04-11 05:37:28 UTC
If I may, I'd like to add a strange finding that may help in finding the cause of this issue or at least may be slightly related. I found that the ipmi module loaded fine in CentOS 5 with kernel 2.6.18-53 with a 2 year old RAID firmware. After we upgraded the firmware to the latest Critical update for the PE2600 in question the ipmi module hangs in a continuous loop. I understand that this bug was reported for kernels >=2.6.20 but this is the second PE2600 we've had this problem on although we thought we were good to go until we upgraded the RAID firmware for the PERC 4/Di.
Comment 11 Weydon 2008-07-01 02:12:57 UTC
ipmi message handler version 39.1
IPMI System Interface driver.
ipmi_si: Trying SMBIOS-specified bt state machine at i/o address 0xe4, slave address 0x20, irq 15
  Using irq 15
IPMI BT: req2rsp=10 secs retries=3
ipmi_si: Error clearing flags: c6
ipmi device interface

CentOS 5 2.6.18-53.1.6.el5 #1 SMP Wed Jan 23 11:30:20 EST 2008 i686 i686 i386 GNU/Linux
My PE6650 server got same problem hangs forever.
Comment 12 Dell Linux Engineering 2008-07-02 00:37:56 UTC
The following patch fixes the issue:

diff -purN base/ipmi_si_intf.c rhel51/ipmi_si_intf.c
--- base/ipmi_si_intf.c	2008-04-01 14:08:34.000000000 -0400
+++ rhel51/ipmi_si_intf.c	2008-04-01 14:00:46.000000000 -0400
@@ -947,10 +947,6 @@ static int smi_start_processing(void    
 
 	new_smi->intf = intf;
 
-	/* Try to claim any interrupts. */
-	if (new_smi->irq_setup)
-		new_smi->irq_setup(new_smi);
-
 	/* Set up the timer that drives the interface. */
 	setup_timer(&new_smi->si_timer, smi_timeout, (long)new_smi);
 	new_smi->last_timeout_jiffies = jiffies;
@@ -2620,6 +2616,10 @@ static int try_smi_init(struct smi_info 
 	setup_oem_data_handler(new_smi);
 	setup_xaction_handlers(new_smi);
 
+	/* Try to claim any interrupts. */
+	if (new_smi->irq_setup)
+		new_smi->irq_setup(new_smi);
+
 	INIT_LIST_HEAD(&(new_smi->xmit_msgs));
 	INIT_LIST_HEAD(&(new_smi->hp_xmit_msgs));
 	new_smi->curr_msg = NULL;


This patch is available as a part of  2.6.24 kernel upsteam and has been backported to RHEL 5.2(kernel 2.6.18-92).

Note You need to log in before you can comment on or make changes to this bug.