Bug 5987 - hda: cdrom_pc_intr: The drive appears confused - ICH7: 100% native mode - irq 209: nobody cared - ASUS P5WD2-Premium
Summary: hda: cdrom_pc_intr: The drive appears confused - ICH7: 100% native mode - irq...
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: i386 (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: platform_i386
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-31 19:35 UTC by Alex Unigovsky
Modified: 2009-07-20 14:15 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.16-rc1-mm3
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Full dmesg log from boot-up to running state. (50.79 KB, text/plain)
2006-01-31 19:41 UTC, Alex Unigovsky
Details
cat /proc/interrupts (918 bytes, text/plain)
2006-01-31 19:42 UTC, Alex Unigovsky
Details
Output of dmidecode 2.7 (21.06 KB, text/plain)
2006-01-31 19:45 UTC, Alex Unigovsky
Details
Output of acpidump (155.75 KB, text/plain)
2006-01-31 19:50 UTC, Alex Unigovsky
Details
Full dmesg with CONFIG_PCI_MSI=n and acpi=off on boot. (24.63 KB, text/plain)
2006-02-01 21:59 UTC, Alex Unigovsky
Details
/proc/interrupts with CONFIG_PCI_MSI=n and acpi=off on boot. (791 bytes, text/plain)
2006-02-01 22:00 UTC, Alex Unigovsky
Details
lspci -vv with CONFIG_PCI_MSI=n and acpi=off on boot. (18.01 KB, text/plain)
2006-02-01 22:00 UTC, Alex Unigovsky
Details
Combined dmesg of 2.6.16-rc1-mm5 (60.97 KB, text/plain)
2006-02-04 20:53 UTC, Alex Unigovsky
Details
Bootup dmesg output (39 bytes, text/plain)
2007-08-07 01:45 UTC, H
Details
Bootup dmesg output (24.00 KB, text/plain)
2007-08-07 01:49 UTC, H
Details

Description Alex Unigovsky 2006-01-31 19:35:35 UTC
Pre Scriptum: Sorry for totally wrong category, but this problem can't be easily
categorized by me.

Most recent kernel where this bug did not occur:

Cannot say exactly, because I was plagued by various ACPI- and SATA-related bugs
since the switch to generic PCI routines on i386 (it was in 2.6.14, IIRC). The
2.6.16-rc1 version was the first one that allowed me to boot without irqpoll,
but the problem remained.

Distribution:

Gentoo Linux

Hardware Environment:

ASUS P5WD2-Premium motherboard (ICH7 for IDE and SATA storage, other controllers
unused);
/* My previous MB got struck by 220V current, so I coudn't compare :) */
Pentium D 3GHz;
1GB RAM;
Marvell 88E8001 [sk98lin] and Intel 82573V [e1000] NICS, both onboard;
nVidia GeForce FX5900 on PCIe16x slot;
Creative SB Audigy2 ZS;
3 SATA hard drives connected to ICH7;
1 ATAPI combo-drive connected to ICH7 as primary master;

Software Environment:

ACPI 2.0 tables and APIC enabled in M/B BIOS;
GRUB 0.97;
Kernel 2.6.16-rc1-mm3 (also tried mm2, mm1, and various 2.6.15 mm's);
No distro or external patches except SquashFS and CDFS;
GCC 4.0.2 with ordinary Gentoo patchset;
Using udev 084;
On ext2, ext3, reiserfs and reiser4 partitions;

Problem Description:

During ACPI init, I can see this (maybe irrelevant) message:
-- -- CUT HERE -- --
ACPI: OEMB (v001 A M I  AMI_OEM  0x10000530 MSFT 0x00000097) @ 0x3ff8e040
  >>> ERROR: Invalid checksum
ACPI: MCFG (v001 A M I  OEMMCFG  0x10000530 MSFT 0x00000097) @ 0x3ff887c0
-- -- CUT HERE -- --

Later on, after the kernel loads, and initscripts start to load modules, the
next (more relevant, IMHO) thing appears:
-- -- CUT HERE -- --
drivers/usb/serial/usb-serial.c: USB Serial Driver core
irq 209: nobody cared (try booting with the "irqpoll" option)
 <c103fbf4> __report_bad_irq+0x24/0x7f   <c103fcd0> note_interrupt+0x81/0x231
 <c100ecda> mark_offset_pmtmr+0x9e/0x10d   <c103f6cd> handle_IRQ_event+0x2e/0x5a
 <c103f7d6> __do_IRQ+0xdd/0xe7   <c100515c> do_IRQ+0x3a/0x52
 =======================
 <c1003652> common_interrupt+0x1a/0x20
handlers:
[<c11c6585>] (ide_intr+0x0/0x1f5)
Disabling IRQ #209
drivers/usb/serial/usb-serial.c: USB Serial support registered for Handspring 
-- -- CUT HERE -- --

The IRQ209 is almost always set up (by APIC, I suppose?) to be shared between
ide0 (ICH7 IDE) and EMU10K1 (Audigy using ALSA). Once I've seen this backtrace
coming from within some sk98lin function, and that time IRQ209 was on sk98lin
NIC (maybe, I don't remember, and I could not reproduce that).

After that backtrace was printed, PC finished loading modules and proceeded to
runlevel scripts, started some programs, and after 10 seconds or so I saw this:
-- -- CUT HERE -- --
input: Bluetooth HID Boot Protocol Device as /class/input/input7
hda: lost interrupt
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
[numerous repating lines cut out]
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
hda: packet command error: status=0xd0 { Busy }
ide: failed opcode was: unknown
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
irq 209: nobody cared (try booting with the "irqpoll" option)
 <c103fbf4> __report_bad_irq+0x24/0x7f   <c103fcd0> note_interrupt+0x81/0x231
 <c100ecda> mark_offset_pmtmr+0x9e/0x10d   <c103f6cd> handle_IRQ_event+0x2e/0x5a
 <c103f7d6> __do_IRQ+0xdd/0xe7   <c100515c> do_IRQ+0x3a/0x52
 =======================
 <c1003652> common_interrupt+0x1a/0x20   <c1001a97> mwait_idle+0x2a/0x34
 <c1001a50> cpu_idle+0x61/0x7e   <c13ca4e1> start_kernel+0x31f/0x3f2
 <c13ca5b4> unknown_bootoption+0x0/0x24c
handlers:
[<c11c6585>] (ide_intr+0x0/0x1f5)
[<f8b8af00>] (snd_emu10k1_interrupt+0x0/0x450 [snd_emu10k1])
Disabling IRQ #209
hda: lost interrupt
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01)
irq 209: nobody cared (try booting with the "irqpoll" option)
 <c103fbf4> __report_bad_irq+0x24/0x7f   <c103fcd0> note_interrupt+0x81/0x231
 <c103f6cd> handle_IRQ_event+0x2e/0x5a   <c103f7d6> __do_IRQ+0xdd/0xe7
 <c100515c> do_IRQ+0x3a/0x52
 =======================
 <c1003652> common_interrupt+0x1a/0x20   <c1001a97> mwait_idle+0x2a/0x34
 <c1001a50> cpu_idle+0x61/0x7e   <c13ca4e1> start_kernel+0x31f/0x3f2
 <c13ca5b4> unknown_bootoption+0x0/0x24c
handlers:
[<c11c6585>] (ide_intr+0x0/0x1f5)
[<f8b8af00>] (snd_emu10k1_interrupt+0x0/0x450 [snd_emu10k1])
Disabling IRQ #209
-- -- CUT HERE -- --

The funny thing is that although my CD-ROM stops working, I can still use the
sound card without any problems and glitches. Didn't try irqpoll yet, but
previous experience tells me that it will only hide stuff from dmesg and not fix
the issue.

As for occasional lockups, I can not say anything right now. It can be related,
or it can be my own fault :)

Full dmesg and /proc/interrupts will follow shortly.

Steps to reproduce:

Compile and install any m#2\.6\.1[56]-mm\d+# kernel. Boot it. See oopses passing
by. Grep dmesg. Think a bit. File a bug :)

Thanks in advance,
Unik.
Comment 1 Alex Unigovsky 2006-01-31 19:41:13 UTC
Created attachment 7193 [details]
Full dmesg log from boot-up to running state.

Produced by combining on-boot-saved dmesg with current one.
Comment 2 Alex Unigovsky 2006-01-31 19:42:57 UTC
Created attachment 7194 [details]
cat /proc/interrupts

See IRQ209.
Comment 3 Alex Unigovsky 2006-01-31 19:45:36 UTC
Created attachment 7195 [details]
Output of dmidecode 2.7
Comment 4 Alex Unigovsky 2006-01-31 19:50:45 UTC
Created attachment 7196 [details]
Output of acpidump

I had a warning "Wrong checksum for generic table!" just before OEMB table.
That's probably the same as the checksum message in dmesg.
Comment 5 Alex Unigovsky 2006-01-31 19:58:11 UTC
Oh, and one thing I forgot: the M/B BIOS version is 0606, the latest one
released by ASUS. By reading their support site I can tell that this is no beta,
so checksum error is strange, considering BIOS image image passed the CRC check
at the time of flashing.
Comment 6 Andrew Morton 2006-01-31 20:34:01 UTC

Begin forwarded message:

Date: Tue, 31 Jan 2006 23:29:20 -0500
From: "Brown, Len" <len.brown@intel.com>
To: "Andrew Morton" <akpm@osdl.org>, "Jeff Garzik" <jgarzik@pobox.com>, "Bartlomiej Zolnierkiewicz" <B.Zolnierkiewicz@elka.pw.edu.pl>
Subject: RE: [Bugme-new] [Bug 5987] New: Oopses at boot, inability to use IDE CD-ROM drive and a lockup once-a-day.


> ICH7: 100% native mode on irq 209 

Has 100% native mode *ever* worked on *any* system?

This BIOS clearly has some ACPI related issues,
but it isn't immediately clear that they're the cause
of the failure.

-Len

Comment 7 Len Brown 2006-01-31 20:34:50 UTC
please boot with "acpi=off", attach the dmesg -s64000 output and paste
a copy of /proc/interrupts.  Please also attach the output from lspci -vv
Comment 8 Andrew Morton 2006-01-31 21:14:39 UTC

Begin forwarded message:

Date: Tue, 31 Jan 2006 23:56:58 -0500
From: Jeff Garzik <jgarzik@pobox.com>
To: "Brown, Len" <len.brown@intel.com>
Cc: Andrew Morton <akpm@osdl.org>, Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl>
Subject: Re: [Bugme-new] [Bug 5987] New: Oopses at boot, inability to use IDE CD-ROM drive and a lockup once-a-day.


Brown, Len wrote:
>>ICH7: 100% native mode on irq 209 
> 
> 
> Has 100% native mode *ever* worked on *any* system?

For libata, definitely.  I could have sworn it worked in IDE driver too...

	Jeff

Comment 9 Len Brown 2006-01-31 21:56:30 UTC
> ACPI: OEMB (v001 A M I  AMI_OEM  0x10000530 MSFT 0x00000097) @ 0x3ff8e040
>  >>> ERROR: Invalid checksum

Evidence of a shoddy BIOS, but not related to the failure at hand.

The acpidump output shows that this table claims a length of 102 bytes.
But the checksum across 102 bytes is non-zero.  It is possible that the
BIOS writer got the checksum right but the length wrong, as the
checksum after 46 bytes is zero.  Perhaps some buggy proprietary OS
recognizes the OEM-specific "OEMB" as a fixed length structure
of 46 bytes and errantly lets this BIOS through its test suite...

Linux ignores any table with a bad checksum.  But as Linux doesn't
recognize an OEMB table, it would ignore it even if the checksum were valid.

Whatever this table is, it is non-volatile:
 BIOS-e820: 000000003ff8e000 - 000000003ffe0000 (ACPI NVS)
Comment 10 Len Brown 2006-01-31 22:15:27 UTC
> ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
> ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)

Another sign of a shoddy BIOS -- duplicate entries in the MADT.
I believe that Linux should survivie this -- programming
these IRQs twice.

Try this:
# /etc/init.d/acpid stop
press the power button a few times and see if the
acpi entry on IRQ9 in /proc/interrupts increments appropriately.

Comment 11 Len Brown 2006-01-31 22:19:51 UTC
To simplify matters, please reproduce this failure
with CONFIG_PCI_MSI=n, and with "nvidia" excluded
from the kernel.
Comment 12 Len Brown 2006-01-31 22:54:01 UTC
possibly a duplicate of bug 5084
Comment 13 Alex Unigovsky 2006-02-01 21:59:36 UTC
Created attachment 7215 [details]
Full dmesg with CONFIG_PCI_MSI=n and acpi=off on boot.
Comment 14 Alex Unigovsky 2006-02-01 22:00:22 UTC
Created attachment 7216 [details]
/proc/interrupts with CONFIG_PCI_MSI=n and acpi=off on boot.
Comment 15 Alex Unigovsky 2006-02-01 22:00:55 UTC
Created attachment 7217 [details]
lspci -vv with CONFIG_PCI_MSI=n and acpi=off on boot.
Comment 16 Alex Unigovsky 2006-02-01 22:11:12 UTC
As requested, I uploaded some logs from kernel compiled without CONFIG_PCI_MSI,
with acpi=off boot parameter and without nvidia module. I also have the same set
of logs/files but without acpi=off. I'll upload them if you need it.

The problem can be reproduced in all 3 cases, with IRQ numbers and backtraces
changing a bit. In case of acpi=off, I noticed that there are less "The drive
appears confused" messages.

mount /mnt/cdrom always freezes and cannot be killed even with killall -9.

One other thing: IDE in BIOS is set to "enchanced mode". Is it related to "100%
native mode" as written in dmesg? The bad thing is that I cannot switch it,
because I need 3xSATA + 1xIDE devices, and "compat. mode" only allows 2-2 split.
Comment 17 Alex Unigovsky 2006-02-01 22:24:21 UTC
Pressing power button without acpid increments IRQ counter by 1 each time on
CPU0 in /proc/interrupts.
Comment 18 Alex Unigovsky 2006-02-04 20:53:43 UTC
Created attachment 7246 [details]
Combined dmesg of 2.6.16-rc1-mm5

Kernel version: 2.6.16-rc1-mm5
Without proprietary modules (nvidia).
With CONFIG_PCI_MSI.
With ACPI debug.
Comment 19 Natalie Protasevich 2007-07-08 16:59:44 UTC
Any updates on this problem?
Thanks.
Comment 20 H 2007-08-07 01:16:44 UTC
I have what appears to be very similar problems, verified on Asus P5W DH Deluxe (ICH7 SATA/PATA) and kernel version 2.6.22.1.

Experienced problem: DVD drive and sound starts acting up at a random point after the system has been working for a while

Kernel version: Linux 2.6.22.1
Hardware: DVD drive is on a ICH7 based PATA controller, sound card is an Audigy 2 ZS


From dmesg:
hdb: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request.
hdb: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request.
irq 23: nobody cared (try booting with the "irqpoll" option)
 [<c014be9a>] __report_bad_irq+0x36/0x75
 [<c014c091>] note_interrupt+0x1b8/0x1f3
 [<c014b5d6>] handle_IRQ_event+0x1a/0x3f
 [<c014c613>] handle_fasteoi_irq+0x8a/0xab
 [<c01063ff>] do_IRQ+0x57/0x70
 [<c0104773>] common_interrupt+0x23/0x28
 [<c01021a6>] mwait_idle_with_hints+0x3b/0x3f
 [<c01021aa>] mwait_idle+0x0/0xa
 [<c0102389>] cpu_idle+0x96/0xcb
 [<c035b93c>] start_kernel+0x318/0x320
 [<c035b17b>] unknown_bootoption+0x0/0x202
 =======================
handlers:
[<f88a7602>] (ide_intr+0x0/0x1c1 [ide_core])
[<f8a922b4>] (snd_emu10k1_interrupt+0x0/0x3cc [snd_emu10k1])
Disabling IRQ #23
hdb: lost interrupt
hdb: lost interrupt
hdb: lost interrupt
hdb: lost interrupt
hdb: lost interrupt
ide-cd: cmd 0x1e timed out
hdb: lost interrupt
hdb: lost interrupt
hdb: lost interrupt
hdb: lost interrupt
hdb: lost interrupt
hdb: lost interrupt
ide-cd: cmd 0x1e timed out
hdb: lost interrupt


From /proc/interrupts:
 23:     500046          0   IO-APIC-fasteoi   ide0, EMU10K1
Comment 21 Natalie Protasevich 2007-08-07 01:35:15 UTC
I think something is wrong about ide0 using level triggered irq line and share it with the other device; IDE is usually edge triggered and therefore non shareable. Can you please post your dmesg.
Comment 22 H 2007-08-07 01:45:57 UTC
Created attachment 12291 [details]
Bootup dmesg output
Comment 23 H 2007-08-07 01:49:14 UTC
Created attachment 12292 [details]
Bootup dmesg output

Sorry, I thought attaching an URL would create a result that made sense, but that just seemed to be a bad idea.
Comment 24 Alan 2009-07-20 14:14:41 UTC
No activity since 2007: Closing

Note You need to log in before you can comment on or make changes to this bug.