Bug 14101

Summary: acpi_idle_enter_bm() calls sched_clock_cpu() with interrupts enabled - i8kmon - Vostro1710
Product: Drivers Reporter: Ben Hutchings (ben)
Component: PlatformAssignee: acpi_platform-drivers (acpi_platform-drivers)
Status: REJECTED DOCUMENTED    
Severity: normal CC: akpm, lenb, tim.w.connors
Priority: P1    
Hardware: All   
OS: Linux   
URL: http://bugs.debian.org/541677
Kernel Version: 2.6.30-5 (Debian) Subsystem:
Regression: Yes Bisected commit-id:
Attachments: acpidump output
lspci output
/proc/cpuinfo
disable the SMI to see whether the issue still happens
try the debug patch, which adds the debug macro to identify where the irq is re-enabled?

Description Ben Hutchings 2009-08-31 20:43:06 UTC
acpi_idle_enter_bm() may call sched_clock_idle_sleep_event(), which calls sched_clock_cpu(), which asserts:

	WARN_ON_ONCE(!irqs_disabled());

Here's a full trace from <http://bugs.debian.org/541677>:

> [  391.549555] ------------[ cut here ]------------
> [  391.549574] WARNING: at
> /build/mattems-linux-2.6_2.6.30-5-amd64-_9JoL1/linux-2.6-2.6.30-5/debian/build/source_amd64_none/kernel/sched_clock.c:170
> sched_clock_cpu+0x7b/0x202()
> [  391.549582] Hardware name: Vostro1710
> [  391.549586] Modules linked in: i915 drm i2c_algo_bit xt_multiport
> binfmt_misc ppdev parport_pc lp parport sco bridge stp bnep rfcomm l2cap
> cachefiles autofs4 acpi_cpufreq cpufreq_userspace cpufreq_stats
> cpufreq_conservative cpufreq_powersave iptable_mangle iptable_nat
> iptable_filter ip_tables ipt_MASQUERADE nf_nat nf_conntrack_ipv4
> nf_defrag_ipv4 x_tables nf_conntrack microcode nfsd exportfs nfs lockd
> fscache nfs_acl auth_rpcgss sunrpc bonding fuse coretemp i8k loop
> snd_hda_codec_realtek arc4 ecb snd_hda_intel snd_hda_codec iwlagn snd_hwdep
> snd_pcm_oss snd_mixer_oss iwlcore snd_pcm snd_seq_midi snd_rawmidi
> snd_seq_midi_event snd_seq uvcvideo snd_timer snd_seq_device dell_laptop
> videodev mac80211 i2c_i801 joydev rfkill snd serio_raw v4l1_compat soundcore
> btusb snd_page_alloc dcdbas v4l2_compat_ioctl32 i2c_core bluetooth pcspkr
> cfg80211 psmouse evdev battery processor wmi button ac ext3 jbd mbcache
> dm_mirror dm_region_hash dm_log dm_snapshot dm_mod raid10 raid456 raid6_pq
> async_x
> r async_memcpy async_tx xor raid1 raid0 multipath linear md_mod sd_mod
> crc_t10dif ide_cd_mod cdrom ata_generic ide_pci_generic ahci video output
> sdhci_pci sdhci mmc_core led_class r8169 mii libata uhci_hcd piix scsi_mod
> ide_core ehci_hcd intel_agp thermal fan thermal_sys
> [  391.549836] Pid: 0, comm: swapper Not tainted 2.6.30-1-amd64 #1
> [  391.549841] Call Trace:
> [  391.549852]  [<ffffffff80258897>] ? sched_clock_cpu+0x7b/0x202
> [  391.549861]  [<ffffffff80258897>] ? sched_clock_cpu+0x7b/0x202
> [  391.549871]  [<ffffffff8024238f>] ? warn_slowpath_common+0x77/0xa3
> [  391.549880]  [<ffffffff80258897>] ? sched_clock_cpu+0x7b/0x202
> [  391.549905]  [<ffffffffa020a4db>] ? acpi_idle_enter_bm+0x16c/0x2a8
> [processor]
> [  391.549916]  [<ffffffff803fcfe0>] ? cpuidle_idle_call+0x8b/0xc7
> [  391.549926]  [<ffffffff8064e140>] ? early_idt_handler+0x0/0x71
> [  391.549936]  [<ffffffff8020edda>] ? cpu_idle+0x50/0x91
> [  391.549943]  [<ffffffff8064ec62>] ? start_kernel+0x37a/0x386
> [  391.549951]  [<ffffffff8064e3b7>] ? x86_64_start_kernel+0xf9/0x106
> [  391.549957] ---[ end trace 1aa56acdd82f3805 ]---
Comment 1 Len Brown 2009-09-01 02:06:19 UTC
please supply the output from lspci -v
Comment 2 Ben Hutchings 2009-09-01 02:10:07 UTC
(In reply to comment #1)
> please supply the output from lspci -v

I can ask the original reporter, but isn't code inspection sufficient?
Comment 3 ykzhao 2009-09-02 00:47:09 UTC
Hi, Ben
    Will you please attach the output of acpidump, dmesg, cat /proc/cpuinfo?
    Thanks.
Comment 4 Ben Hutchings 2009-09-02 01:41:46 UTC
(In reply to comment #3)
> Hi, Ben
>     Will you please attach the output of acpidump, dmesg, cat /proc/cpuinfo?
>     Thanks.

dmesg output is in the Debian bug report. I will ask for the others.
Comment 5 Ben Hutchings 2009-09-02 21:41:53 UTC
Created attachment 22980 [details]
acpidump output
Comment 6 Ben Hutchings 2009-09-02 21:42:15 UTC
Created attachment 22981 [details]
lspci output
Comment 7 Ben Hutchings 2009-09-02 21:42:34 UTC
Created attachment 22982 [details]
/proc/cpuinfo
Comment 8 Andrew Morton 2009-09-03 00:18:28 UTC
Guys, we shouldn't need lspci and acpidump output to find out what this bug is.

But I don't get it.  acpi_idle_enter_bm() does local_irq_disable() prior to calling sched_clock_idle_sleep_event(), so why did the warning trigger?
Comment 9 Ben Hutchings 2009-09-03 00:45:37 UTC
(In reply to comment #8)
> Guys, we shouldn't need lspci and acpidump output to find out what this bug
> is.
> 
> But I don't get it.  acpi_idle_enter_bm() does local_irq_disable() prior to
> calling sched_clock_idle_sleep_event(), so why did the warning trigger?

Yes, right, I misread the sense of the warning, so the summary is wrong.

No idea what's happened here. If you refer back to the original Debian bug report, you'll see some other symptoms reported which might point at a deeper problem (hardware or SMM misbehaviour) or might be cascading faults.
Comment 10 ykzhao 2009-09-07 02:04:24 UTC
After checking the source code, it seems that the irq will be disabled before calling the function of sched_clock_cpu.
Why is the irq enabled spontaneously? Maybe it is caused by SMM misbehaviour.


Thanks.
Comment 11 Len Brown 2009-09-09 02:10:14 UTC
lspci shows that this is an 82801H (ICH8 Family)

Yakui, maybe a test patch to flip the chipset bit that disables SMI/SMM
will prove that the instability is due to SMM or not?
Comment 12 ykzhao 2009-09-11 02:11:33 UTC
Created attachment 23059 [details]
disable the SMI to see whether the issue still happens

Will you please try the debug patch on the latest kernel and see whether the issue still happens?

This debug patch is only to check whether the issue is caused by the SMI. In this patch the SMI is disabled after ACPI is enabled.

After the test, please attach the output of dmesg.

Thanks.
Comment 13 Len Brown 2009-11-03 02:32:58 UTC
http://kerneloops.org/searchweek.php?search=acpi_idle_enter_bm
shows that a lot (of apparently 32-bit machines) are running into this.
dunno if they are all ICH8's

Yakui, please add a 2nd debug patch with a WARN_ONCE in acpi_idle_enter_bm
to figure out exactly where we first have interrupts disabled on us.
I'll bet it is the actual entry to c3, but who knows...
Comment 14 ykzhao 2009-11-06 05:42:28 UTC
Created attachment 23673 [details]
try the debug patch, which adds the debug macro to identify where the irq is re-enabled?
Comment 15 ykzhao 2009-11-06 05:47:33 UTC
This issue is different with the kerneloops in kerneloops.org.
It seems that the issue in the kerneloops.org is caused by that condition_schedule is called in ISR/softirq after the irq is enabled. (When the cpu exits the C-state, it will be re-enable the interrupt. And then OS will process the ISR/softirq).

But this issue in this bug is that the irq is re-enabled spontaneously before entering the deep C-state.

Hi, Ben
   Will you please try the debug patch in comment #14 and attach the output of dmesg when the issue happens?
 
Thanks.
Comment 16 Tim Connors 2009-11-08 05:10:48 UTC
I'm the original reporter, and am now subscribed to this report.

The laptop is currently in "production", so I can't apply the patch until I can afford to reboot and crash the machine, however, one bit of trivia that didn't get into this report is that the bug is related to the i8k module.  With i8kmon running (which polls /proc/i8k I believe), the machine crashes in a small number of minutes (it stutters and pauses for a few seconds at a time and gets worse and worse until it panics).

If on the console, I cat /proc/i8kmon in a tight bash loop, then I could induce the panic in a few seconds after a few stutters, and if I recall correctly, after seeing a couple of traces (can't remember whether they were BUGs or otherwise).  The traces almost always involved "coretemp" if I recall correctly.

The hardware isn't an inspiron 8000, but that module had worked on most dell laptops I had come across, and was giving sane temperatures (if not fan speed) and was able to control the fan, so I kept on using it until I worked out it was crashing 2.6.28 and above.

If the debug will still be useful knowing these points, I'll get back to you when I'm able to apply the patch.
Comment 17 ykzhao 2009-11-13 03:40:20 UTC
Do you mean that the i8k module is used on your box?
   It seems that it will enter SMM mode in i8k. If the i8kmon polls the /proc/i8k interface, it will frequently enter the SMM mode. Maybe the IRQ will be enabled after entering the SMM. At the same time as the SMM is invisible to OS, we can't know where/when the IRQ is enabled.
   If the IRQ is enabled before we enter the C-state, sometimes it will complain the following warning message:

     ------------[ cut here ]------------
>> [  391.549836] Pid: 0, comm: swapper Not tainted 2.6.30-1-amd64 #1
> [  391.549841] Call Trace:
> [  391.549852]  [<ffffffff80258897>] ? sched_clock_cpu+0x7b/0x202
> [  391.549861]  [<ffffffff80258897>] ? sched_clock_cpu+0x7b/0x202
> [  391.549871]  [<ffffffff8024238f>] ? warn_slowpath_common+0x77/0xa3
> [  391.549880]  [<ffffffff80258897>] ? sched_clock_cpu+0x7b/0x202
> [  391.549905]  [<ffffffffa020a4db>] ? acpi_idle_enter_bm+0x16c/0x2a8
> [processor]
> [  391.549916]  [<ffffffff803fcfe0>] ? cpuidle_idle_call+0x8b/0xc7
> [  391.549926]  [<ffffffff8064e140>] ? early_idt_handler+0x0/0x71
> [  391.549936]  [<ffffffff8020edda>] ? cpu_idle+0x50/0x91
> [  391.549943]  [<ffffffff8064ec62>] ? start_kernel+0x37a/0x386
> [  391.549951]  [<ffffffff8064e3b7>] ? x86_64_start_kernel+0xf9/0x106
> [  391.549957] ---[ end trace 1aa56acdd82f3805 ]---

How about this issue if you don't load the i8k module or try the patch in comment #12?

thanks.
Comment 18 ykzhao 2009-12-22 01:43:09 UTC
From the log it seems that this issue is related with the SMI operation. In such case it is transparent to OS and we don't know when the irq is enabled.
At the same time as there is no response for more than one month, the bug will be rejected.