Bug 10216

Summary: kernel from different versions hangs with some CD/DVD drives
Product: IO/Storage Reporter: lars.winterfeld
Component: IDEAssignee: io_ide (io_ide)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: headless
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all tried (2.4.* and 2.6.*) Subsystem:
Regression: --- Bisected commit-id:
Attachments: dmesg from 2.6.25-gentoo-r6 kernel
config diff between 2.6.24-gentoo-r8 and 2.6.25-gentoo-r6
working .config for 2.6.24-gentoo-r8
config for 2.6.25-gentoo-r9 for both patched and unpatched
config for 2.6.26-gentoo-r3 for both patched and unpatched
config for 2.6.27-gentoo-r4 for both patched and unpatched
dmesg from unpatched kernel 2.6.24-gentoo-r8
dmesg from unpatched kernel 2.6.25-gentoo-r9
dmesg from patched kernel 2.6.25-gentoo-r9
dmesg from unpatched kernel 2.6.26-gentoo-r3
dmesg from patched kernel 2.6.26-gentoo-r3
dmesg from unpatched kernel 2.6.27-gentoo-r4
dmesg from patched kernel 2.6.27-gentoo-r4
dmesg from unpatched original kernel 2.6.25.9
dmesg from patched original kernel 2.6.25.9

Description lars.winterfeld 2008-03-10 10:38:05 UTC
Latest working kernel version: unknown
Earliest failing kernel version: 2.4
Distribution: all tried
Hardware Environment: notebook "Fujitsu Siemens Amilo Xa-2528", output from `lspci -vvv` attached
Software Environment: n/a
Problem Description:
I want to report a problem with certain CD/DVD drives / controllers, which is kernel-related as far as I can see. I came to face the problem on my new notebook (Fujitsu Siemens Amilo Xa-2528). Although I tried the installation CDs of different distributions (Gentoo, Ubuntu, DSL,...) with different kernel-versions including some of the 2.4 and 2.6 series all failed to boot. Here is part of the log:

Mar  7 20:51:59 (none) Freeing unused kernel memory: 376k freed
Mar  7 20:51:59 (none) ide-cd: cmd 0x5a timed out
Mar  7 20:51:59 (none) hda: irq timeout: status=0x58 { DriveReady SeekComplete DataRequest }
Mar  7 20:51:59 (none) ide: failed opcode was: unknown
Mar  7 20:51:59 (none) hda: ATAPI CD-ROM drive, 0kB Cache
Mar  7 20:51:59 (none) Uniform CD-ROM driver Revision: 3.20
Mar  7 20:51:59 (none) hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Mar  7 20:51:59 (none) ide: failed opcode was: unknown
Mar  7 20:51:59 (none) hda: drive not ready for command
Mar  7 20:51:59 (none) hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
Mar  7 20:51:59 (none) ide: failed opcode was: unknown
Mar  7 20:51:59 (none) hda: drive not ready for command

Last error messages are repetitiously outputted until the computer freezes / hangs. By googleing the error-message I found that it is not a problem of my computer alone. ( http://ubuntuforums.org/showthread.php?t=699244 , https://bugs.launchpad.net/ubuntu/+source/linux/+bug/182996 , http://forums.gentoo.org/viewtopic-t-666680.html?sid=36e4b5379601abe70a5f2fc5420e1765 , ...)

As far as I can tell, the problem appears as soon as the CD/DVD drive is accessed, which is of cause done by almost any linux installation or "live" CD. So I tried a Gentoo-based boot-cd called "unattended gui" which does not mount the cd filesystem (btw: kernel config http://unattended-gui.sourceforge.net/wiki/index.php?title=Bootcd:kernel )

Once booted, I ran lspci telling me:
00:0d.0 IDE interface: nVidia Corporation MCP51 IDE (rev f1)
00:0e.0 IDE interface: nVidia Corporation MCP51 Serial ATA Controller (rev f1)

The drive inside my notebook is recognized as "hda: Optiarc DVD RW AD-7540A, ATAPI CD/DVD-ROM drive", but google tells me the problem (or similar) also appears on "PHILIPS DVDR1660P1", "BENQ DVD DD DW1620", "LILITE-ON DVD SOHD-16P6P9S", "RICOH CD-R/RW MP7040A" and other.

The pre-installed Windows Vista can read and write DVDs.

Some log (from the "unattended"...):
http://www.winterfeld.de/Lars/amalio/lspci-vvv
http://www.winterfeld.de/Lars/amalio/messages
http://www.winterfeld.de/Lars/amalio/dmesg
produced on this system:
http://www.winterfeld.de/Lars/amalio/cpuinfo
http://www.winterfeld.de/Lars/amalio/iomem
http://www.winterfeld.de/Lars/amalio/ioports
http://www.winterfeld.de/Lars/amalio/modules
http://www.winterfeld.de/Lars/amalio/scsi
http://www.winterfeld.de/Lars/amalio/uname-a


Since I also program in C, I took a look at some ide*.c files where I found the error message, but I have no clue about how the kernel works. So, I would be really grateful for any suggestions and if more information is needed, then please just ask.

Lars

Steps to reproduce: boot problem...
Comment 1 Bartlomiej Zolnierkiewicz 2008-06-18 16:00:38 UTC
Could you try some newer kernel (2.6.25 or 2.6.26-rc6), there has been a lot of fixes for ATA/ATAPI support since 2.6.24.
Comment 2 lars.winterfeld 2008-07-08 02:56:25 UTC
Created attachment 16766 [details]
dmesg from 2.6.25-gentoo-r6 kernel
Comment 3 lars.winterfeld 2008-07-08 03:01:19 UTC
For 2.6.24 I found some config, which caused the error not to appear (see bottom of http://gentoo-wiki.com/HARDWARE_Fujitsu-Siemens_Amilo_Xa2528 ). I could even read from the drive. Now I'm using 2.6.25-gentoo-r6 and the error appears again (dmesg attached above) and the device is no longer present in /dev but the notebook boots (with a delay of 6 minutes or so).
I'm still looking at the diff of the kernel-configs. Couldn't yet find anything, which could have caused the problem. (So it could be the fault of "lot of fixes" ;-) If someone could tell me some options which could be related to the problem, I would try the combinations of them and post the log-files.
Comment 4 lars.winterfeld 2008-07-08 03:03:17 UTC
Created attachment 16767 [details]
config diff between 2.6.24-gentoo-r8 and 2.6.25-gentoo-r6
Comment 5 Bartlomiej Zolnierkiewicz 2008-11-09 07:53:26 UTC
This actually seems to be a controller specific issue (quite surprising at that).
Could you try the fix proposed for bug #11659 (as after the fact it seems to be same issue):

http://bugzilla.kernel.org/attachment.cgi?id=18748&action=view
Comment 6 lars.winterfeld 2008-11-10 15:15:04 UTC
wanted to try it with 2.6.25.9, but include/linux/ide.h already contains a (1 << 26) flag: IDE_HFLAG_ABUSE_SET_DMA_MODE. all others up to 1<<31 are also used... because i do not want to break something else, i would try to store the "use-workaround"-flag somewhere else. what would you advice?
Comment 7 Bartlomiej Zolnierkiewicz 2008-11-10 15:40:02 UTC
Hmm, the safest bet would be to define IDE_HFLAG_BROKEN_ALTSTATUS flag to some flag that is set unconditionally in amd74xx host driver -- i.e. adding:

        IDE_HFLAG_BROKEN_ALTSTATUS = IDE_HFLAG_PIO_NO_BLACKLIST,

at the end of enum { } clause.
Comment 8 Bartlomiej Zolnierkiewicz 2008-11-13 13:20:20 UTC
Any news on this?
Comment 9 lars.winterfeld 2008-11-13 15:00:40 UTC
i'll try it tomorrow. i'm going to post dmesg from the kernel with and without the patch. otherwise i'll do it on sunday, latest.
Comment 10 lars.winterfeld 2008-11-23 04:12:30 UTC
you are my hero! ;-)
it worked. i tried it with the original source of linux-2.6.25.9
without the patch the described problem appears, with the patch (and nothing else changed) the kernel boots just fine, /dev/cdrom gets created and i can read files from it.
i'm now going to attach some files (dmesg, maybe from other kernel versions, etc.) just for references...
i think, afterwords you can mark this bug as closed or resolved...
would like to see this in the next kernel version... :-)
Comment 11 lars.winterfeld 2008-11-23 13:59:46 UTC
Created attachment 18979 [details]
working .config for 2.6.24-gentoo-r8
Comment 12 lars.winterfeld 2008-11-23 14:00:41 UTC
Created attachment 18980 [details]
config for 2.6.25-gentoo-r9 for both patched and unpatched
Comment 13 lars.winterfeld 2008-11-23 14:01:48 UTC
Created attachment 18981 [details]
config for 2.6.26-gentoo-r3 for both patched and unpatched
Comment 14 lars.winterfeld 2008-11-23 14:02:18 UTC
Created attachment 18982 [details]
config for 2.6.27-gentoo-r4 for both patched and unpatched
Comment 15 lars.winterfeld 2008-11-23 14:03:01 UTC
Created attachment 18983 [details]
dmesg from unpatched kernel 2.6.24-gentoo-r8
Comment 16 lars.winterfeld 2008-11-23 14:03:32 UTC
Created attachment 18984 [details]
dmesg from unpatched kernel 2.6.25-gentoo-r9
Comment 17 lars.winterfeld 2008-11-23 14:04:46 UTC
Created attachment 18985 [details]
dmesg from patched kernel 2.6.25-gentoo-r9
Comment 18 lars.winterfeld 2008-11-23 14:06:11 UTC
Created attachment 18986 [details]
dmesg from unpatched kernel 2.6.26-gentoo-r3
Comment 19 lars.winterfeld 2008-11-23 14:06:55 UTC
Created attachment 18987 [details]
dmesg from patched kernel 2.6.26-gentoo-r3
Comment 20 Sergei Shtylyov 2008-11-23 14:44:51 UTC
Note the "Clocksource tsc unstable" message from the patche kernel, presuably while probing ide0.
Comment 21 lars.winterfeld 2008-11-23 14:52:55 UTC
Created attachment 18988 [details]
dmesg from unpatched kernel 2.6.27-gentoo-r4
Comment 22 lars.winterfeld 2008-11-23 14:53:29 UTC
Created attachment 18989 [details]
dmesg from patched kernel 2.6.27-gentoo-r4
Comment 23 lars.winterfeld 2008-11-23 14:55:19 UTC
Created attachment 18990 [details]
dmesg from unpatched original kernel 2.6.25.9
Comment 24 lars.winterfeld 2008-11-23 14:55:47 UTC
Created attachment 18991 [details]
dmesg from patched original kernel 2.6.25.9
Comment 25 lars.winterfeld 2008-11-23 15:18:18 UTC
by looking through the dmesg output, you can see that the patch http://bugzilla.kernel.org/attachment.cgi?id=18748&action=view did it for me in all tested versions from 2.6.25 to 2.6.27.
i patched the original kernel ("vanilla-sources" as its called in gentoo) and different sources which are patched by gentoo-developers. (main drivers part should be the same...)
i patched drivers/ide/ide-iops.c, drivers/ide/ide-probe.c, drivers/ide/pci/amd74xx.c as described and include/linux/ide.h like this:
        IDE_HFLAG_NO_UNMASK_IRQS        = (1 << 31),
+       /* AltStatus register can be unreliable*/
+       IDE_HFLAG_BROKEN_ALTSTATUS = IDE_HFLAG_PIO_NO_BLACKLIST,
  };


what is that "Clocksource tsc unstable" about?
Comment 26 Sergei Shtylyov 2008-11-23 15:24:32 UTC
(In reply to comment #25)
 
> what is that "Clocksource tsc unstable" about?

Presumably very long delay with interrupts blocked.
Comment 27 Sergei Shtylyov 2008-11-23 15:30:05 UTC
Bart, looks like ide_wait_not_busy() needs to call clocksource_touch_watchdog().
Comment 28 Remy LABENE 2008-11-24 00:37:48 UTC
Indeed, it seems that it's the same problem as here 

http://bugzilla.kernel.org/show_bug.cgi?id=11659
Comment 29 Bartlomiej Zolnierkiewicz 2008-12-02 11:19:21 UTC
Lars: thanks for testing, I updated patch description accordingly.

Sergei: seems so, could you make a patch for it?
Comment 30 Sergei Shtylyov 2008-12-02 12:23:24 UTC
(In reply to comment #29)
> Lars: thanks for testing, I updated patch description accordingly.

> Sergei: seems so, could you make a patch for it?

I keep been overloaded for months already...
Comment 31 Bartlomiej Zolnierkiewicz 2008-12-07 13:08:06 UTC
Sergei: I looked into it (under QEMU for now) and adding msleep(1000) + printk() to ide_init() causes the "Clocksource tsc unstable" to appear before IDE initialization so it seems that the issue is caused by some other kernel code and IDE is just unlucky to be initialized at same time as clocksource watchdog triggers.
Comment 32 Sergei Shtylyov 2008-12-07 14:38:54 UTC
(In reply to comment #31)
> Sergei: I looked into it (under QEMU for now) and adding msleep(1000) +
> printk() to ide_init() causes the "Clocksource tsc unstable" to appear before
> IDE initialization so it seems that the issue is caused by some other kernel
> code and IDE is just unlucky to be initialized at same time as clocksource
> watchdog triggers.

I guess it's caused by mspleep(1000) call itself. If ide_wait_not_ready() is touching the other watchdogs anyway, I don't see why it shouldn't touch the clocksource watchdog too.
Comment 33 Sergei Shtylyov 2008-12-07 15:23:08 UTC
(In reply to comment #32)
> (In reply to comment #31)
> > Sergei: I looked into it (under QEMU for now) and adding msleep(1000) +
> > printk() to ide_init() causes the "Clocksource tsc unstable" to appear
> before
> > IDE initialization so it seems that the issue is caused by some other
> kernel
> > code and IDE is just unlucky to be initialized at same time as clocksource
> > watchdog triggers.

> I guess it's caused by mspleep(1000) call itself.

Looks like it hasn't been a wild guess. The clocksource watchdog timer handler gets run every half second. :-)