Bug 11426 - reproducible panic when reading cdrom (atapi dma)
Summary: reproducible panic when reading cdrom (atapi dma)
Status: CLOSED CODE_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-08-25 12:24 UTC by Mathis Ahrens
Modified: 2011-08-26 15:19 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.27-rc5
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
config (87.62 KB, text/plain)
2008-08-25 12:27 UTC, Mathis Ahrens
Details
dmesg (22.71 KB, text/plain)
2008-08-25 12:28 UTC, Mathis Ahrens
Details
config-2.6.27-rc4 (UP) (33.65 KB, text/plain)
2008-08-28 04:56 UTC, Mathis Ahrens
Details
dmesg-2.6.27-rc4 (UP) (15.61 KB, text/plain)
2008-08-28 04:57 UTC, Mathis Ahrens
Details
panic-2.6.27-rc4 (UP) (1.34 KB, text/plain)
2008-08-28 04:57 UTC, Mathis Ahrens
Details
large-dma-pad.patch (396 bytes, patch)
2011-07-27 12:37 UTC, Tejun Heo
Details | Diff
dmidecode (6.82 KB, text/plain)
2011-07-28 15:03 UTC, Jim Bray
Details
lspci -nnvvv (10.44 KB, text/plain)
2011-07-28 15:04 UTC, Jim Bray
Details
dmesg pata kernel with ata_dma_pad_sz=1024 (22.85 KB, text/plain)
2011-07-28 15:07 UTC, Jim Bray
Details
via-no-atapi-dma-on-averatec3200.patch (842 bytes, patch)
2011-07-30 12:33 UTC, Tejun Heo
Details | Diff
screen-shot at freeze (177.24 KB, image/jpeg)
2011-07-30 17:53 UTC, Jim Bray
Details
via-no-atapi-dma-on-averatec3200-1.patch (840 bytes, patch)
2011-07-30 18:00 UTC, Tejun Heo
Details | Diff

Description Mathis Ahrens 2008-08-25 12:24:45 UTC
Latest working kernel version: 2.6.22
Earliest failing kernel version: 2.6.24
Distribution: Ubuntu 8.04 with vanilla kernel
Hardware Environment: Fujitsu/Siemens Amilo Laptop
Problem Description: 
When reading a few hundred MB from cdrom using DMA, kernel panics eventually.

There are two setups WITHOUT this bug:
 - older kernels (where the cdrom gets mounted as hdc)
 - current kernels with libata.dma=1 (PIO4 mode ATAPI)

Steps to reproduce:
$ cp /media/cdromName/1-gb-file /home/onDisk 

After a few hundred MB, hard lockup: no LEDs, no SysRq, no logs.

netconsole gave me the following output:

[  525.388025] divide error: 0000 [#1] SMP
[  525.388025] Modules linked in: netconsole configfs isofs binfmt_misc radeon drm rfcomm l2cap bluetooth ppdev powernow_k7 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative video output sbs sbshc container iptable_filter ip_tables x_tables sbp2 lp ipv6 joydev tulip pcmcia serio_raw psmouse parport_pc parport pcspkr pwc videodev v4l1_compat i2c_viapro i2c_core via_ircc battery irda ac crc_ccitt yenta_socket button via_agp agpgart rsrc_nonstatic pcmcia_core shpchp pci_hotplug evdev ext3 jbd mbcache sg sr_mod cdrom sd_mod pata_via pata_acpi ata_generic libata via_rhine mii ehci_hcd uhci_hcd scsi_mod dock ohci1394 usbcore ieee1394 thermal processor fan thermal_sys fuse
[  525.388025]
[  525.388025] Pid: 6084, comm: trackerd Not tainted (2.6.26.3 #2)
[  525.388025] EIP: 0060:[<c0103d3c>] EFLAGS: 00200202 CPU: 0
[  525.388025] EIP is at sysenter_past_esp+0x39/0xb1
[  525.388025] EAX: 0000008c EBX: 0000001e ECX: 00000000 EDX: 000000d8
[  525.388025] ESI: b7959bb0 EDI: 00000000 EBP: b7959b90 ESP: d9391fb8
[  525.388025]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  525.388025] Process trackerd (pid: 6084, ti=d9390000 task=d9fba3f0 task.ti=d9390000)
[  525.388025] Stack: 0000001e 00000000 000f0000 b7959bb0 00000000 b7959b90 0000008c 0000007b
[  525.388025]        0000007b 00000000 0000008c b7f1b424 00000073 00200292 b7959b90 0000007b
[  525.388025]        00008067 00007067
[  525.388025] Call Trace:
[  525.388025]  =======================
[  525.388025] Code: b4 24 34 e0 ff ff 50 fc 0f a0 06 1e 50 55 57 56 52 51 53 ba 7b 00 00 00 8e da 8e c2 ba d8 00 00 00 8e e2 fb 8d 04 05 00 00 00 00 <90> 8d 80 00 00 00 00 81 fd fd ff ff bf 0f 83 fa 01 00 00 8b 6d
[  525.388025] EIP: [<c0103d3c>] sysenter_past_esp+0x39/0xb1 SS:ESP 0068:d9391fb8
[  525.388025] ---[ end trace 5028e900103372a1 ]---
[  525.399790] divide error: 0000 [#2] SMP
[  525.399790] Modules linked in: netconsole configfs isofs binfmt_misc radeon drm rfcomm l2cap bluetooth ppdev powernow_k7 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative video output sbs sbshc container iptable_filter ip_tables x_tables sbp2 lp ipv6 joydev tulip pcmcia serio_raw psmouse parport_pc parport pcspkr pwc videodev v4l1_compat i2c_viapro i2c_core via_ircc battery irda ac crc_ccitt yenta_socket button via_agp agpgart rsrc_nonstatic pcmcia_core shpchp pci_hotplug evdev ext3 jbd mbcache sg sr_mod cdrom sd_mod pata_via pata_acpi ata_generic libata via_rhine mii ehci_hcd uhci_hcd scsi_mod dock ohci1394 usbcore ieee1394 thermal processor fan thermal_sys fuse
[  525.399790] 
[  525.399790] Pid: 0, comm: swapper Tainted: G      D   (2.6.26.3 #2)
[  525.399790] EIP: 0060:[<e081d2ae>] EFLAGS: 00000292 CPU: 0
[  525.399790] EIP is at acpi_idle_enter_simple+0x16f/0x1dd [processor]
[  525.399790] EAX: c0479170 EBX: d5650000 ECX: 00006767 EDX: 00f8f000
[  525.399790] ESI: d5650450 EDI: 001c3383 EBP: 001b8f17 ESP: c041ffac
[  525.399790]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[  525.399790] Process swapper (pid: 0, ti=c041e000 task=c03f0340 task.ti=c041e000)
[  525.399790] Stack: 0000a46c 00000000 d56500cc d56500dc d565001c 00922007 c0288c9a c0288c10
[  525.399790]        00000000 c047c480 c0102753 00000000 00099800 c040f000 c04239ef 00000035
[  525.399790]        c0423130 00000000 c044e900 00000000 00000000
[  525.399790] Call Trace:
[  525.399790]  [<c0288c9a>] cpuidle_idle_call+0x8a/0xd0
[  525.399790]  [<c0288c10>] cpuidle_idle_call+0x0/0xd0
[  525.399790]  [<c0102753>] cpu_idle+0x63/0xe0
[  525.399790]  =======================
[  525.399790] Code: 8d 04 07 99 89 04 24 b8 17 01 00 00 89 54 24 04 f7 24 24 69 4c 24 04 17 01 00 00 8d 14 11 e8 7a 15 92 df fb 8d 04 05 00 00 00 00 <90> 90 89 e0 31 c9 25 00 e0 ff ff 89 f2 83 48 0c 02 89 d8 ff 46
[  525.399790] EIP: [<e081d2ae>] acpi_idle_enter_simple+0x16f/0x1dd [processor] SS:ESP 0068:c041ffac
[  525.399790] ---[ end trace 5028e900103372a1 ]---
[  525.399790] Kernel panic - not syncing: Attempted to kill the idle task!


$ lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge (rev 80)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8237 PCI Bridge
00:09.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 80)
00:0b.0 CardBus bridge: O2 Micro, Inc. OZ601/6912/711E0 CardBus/SmartCardBus Controller (rev 20)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:11.5 Multimedia audio controller: VIA Technologies, Inc. VT8233/A/8235/8237 AC97 Audio Controller (rev 50)
00:11.6 Communication controller: VIA Technologies, Inc. AC'97 Modem Controller (rev 80)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon RV250 [Mobility FireGL 9000] (rev 01)
02:00.0 Ethernet controller: ADMtek 21x4x DEC-Tulip compatible 10/100 Ethernet (rev 11)
Comment 1 Mathis Ahrens 2008-08-25 12:27:23 UTC
Created attachment 17441 [details]
config
Comment 2 Mathis Ahrens 2008-08-25 12:28:03 UTC
Created attachment 17442 [details]
dmesg
Comment 3 Andrew Morton 2008-08-25 12:48:34 UTC
A divide-by-zero in acpi_idle_enter_simple()

Marked as regression, reassigned to acpi people.
Comment 4 Venkatesh Pallipadi 2008-08-25 14:13:30 UTC
This does not seem to be ACPI problem. Looks like there is some memory corruption happening when dma is used to access CD ROM.

The divide error in acpi_idle_enter_simple(), the trace shows that the instruction causing the error is nop, nop, followed by mov, which should never happen. acpi_idle_enter_simple has one divide in the code, but it is a divide by a constant value. So that should never cause divide error as well.
Also, acpi_idle_enter_simple error is the #2 error. The #1 div error seems to be at sysenter_past_esp(), which again according to trace, is nop followed by lea instruction. Those instructions should not cause div error as well.

Most likely, the actual code fetched from I-cache was different and was corrupted, and code sequence we print in the above trace is coming from L2/D-cache and probably different from what was tried to be executed.

I think this should get attention of some SATA/ATAPI folks.
Comment 5 Mathis Ahrens 2008-08-25 14:36:00 UTC
I think so, too:
  acpi=off AND libata.dma=1   --> still no bug
  acpi=off ONLY               --> still panics:

[  698.590308] divide error: 0000 [#1] SMP 
[  698.590308] Modules linked in: isofs binfmt_misc radeon drm rfcomm l2cap bluetooth apm ppdev cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative netconsole configfs iptable_filter ip_tables x_tables ipv6 sbp2 lp joydev i2c_viapro analog i2c_core tulip parport_pc gameport via_ircc parport pcmcia psmouse serio_raw irda pcspkr crc_ccitt yenta_socket rsrc_nonstatic pcmcia_core shpchp pci_hotplug via_agp agpgart evdev ext3 jbd mbcache sg sr_mod cdrom sd_mod pata_via pata_acpi ata_generic via_rhine mii libata scsi_mod dock uhci_hcd ehci_hcd usbcore ohci1394 ieee1394 thermal_sys fuse
[  698.590308] 
[  698.590308] Pid: 10574, comm: cp Not tainted (2.6.26.3 #2)
[  698.590308] EIP: 0060:[<c0319a86>] EFLAGS: 00000282 CPU: 0
[  698.590308] EIP is at _spin_unlock_irqrestore+0x6/0x10
[  698.590308] EAX: 00000282 EBX: 00000001 ECX: e08e6aa0 EDX: 00000282
[  698.590308] ESI: dfb2154c EDI: 00000002 EBP: 00000000 ESP: c083dca4
[  698.590308]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  698.590308] Process cp (pid: 10574, ti=c083c000 task=d97913b0 task.ti=c083c000)
[  698.590308] Stack: e0a08715 00000000 00000000 c01b0708 df93c000 00000001 00000282 df93c074 
[  698.590308]        00000021 2150eca0 df80d0a0 00000000 00000000 0000000e c0162140 c0412580 
[  698.590308]        0000000e 00000000 c04125b0 c0163a37 00000700 c0412580 00000000 0000000e 
[  698.590308] Call Trace:
[  698.590308]  [<e0a08715>] ata_sff_interrupt+0xe5/0x1f0 [libata]
[  698.590308]  [<c01b0708>] __block_prepare_write+0x218/0x3c0
[  698.590308]  [<c0162140>] handle_IRQ_event+0x30/0x60
[  698.590308]  [<c0163a37>] handle_level_irq+0x77/0xe0
[  698.590308]  [<c01063bb>] do_IRQ+0x3b/0x70
[  698.590308]  [<c01047cf>] common_interrupt+0x23/0x28
[  698.590308]  [<c0215446>] __copy_from_user_ll_nocache_nozero+0x36/0x100
[  698.590308]  [<c01673a8>] iov_iter_copy_from_user_atomic+0x48/0x80
[  698.590308]  [<c016896e>] generic_file_buffered_write+0x13e/0x650
[  698.590308]  [<c01e366c>] security_inode_need_killpriv+0xc/0x10
[  698.590308]  [<c0167445>] remove_suid+0x15/0x60
[  698.590308]  [<c01a4aed>] mnt_drop_write+0x1d/0xe0
[  698.590308]  [<c016910d>] __generic_file_aio_write_nolock+0x28d/0x4f0
[  698.590308]  [<c01a4f01>] mnt_want_write+0x21/0x60
[  698.590308]  [<c0169d51>] generic_file_aio_read+0x591/0x600
[  698.590308]  [<c01693d5>] generic_file_aio_write+0x65/0xe0
[  698.590308]  [<e08eb6a0>] ext3_file_write+0x30/0xc0 [ext3]
[  698.590308]  [<c018d135>] do_sync_write+0xd5/0x120
[  698.590308]  [<c0318938>] mutex_lock+0x8/0x20
[  698.590308]  [<c0139800>] autoremove_wake_function+0x0/0x40
[  698.590308]  [<c01e369c>] security_file_permission+0xc/0x10
[  698.590308]  [<c018d2fe>] rw_verify_area+0x5e/0xd0
[  698.590309]  [<c018d060>] do_sync_write+0x0/0x120
[  698.590309]  [<c018da1f>] vfs_write+0x9f/0x160
[  698.590309]  [<c018e141>] sys_write+0x41/0x70
[  698.590309]  [<c0103d7b>] sysenter_past_esp+0x78/0xb1
[  698.590309]  =======================
[  698.590309] Code: c2 31 c0 5b 85 d2 0f 95 c0 c3 8d 74 26 00 8d bc 27 00 00 00 00 fe 00 c3 8d b6 00 00 00 00 8d bc 27 00 00 00 00 fe 00 89 d0 50 9d <8d> 04 05 00 00 00 00 90 c3 90 fe 00 fb 8d 04 05 00 00 00 00 90 
[  698.590309] EIP: [<c0319a86>] _spin_unlock_irqrestore+0x6/0x10 SS:ESP 0068:c083dca4
[  698.590309] Kernel panic - not syncing: Fatal exception in interrupt
Comment 6 Mathis Ahrens 2008-08-28 04:54:27 UTC
I tested 2.6.27-rc4 with a completely stripped down UP config (no ACPI, some random debugging enabled) and the Bug is still there:

 pio mode (libata.dma=1)  --> no bug
 default (dma=7)          --> panic

I will attach the new config, dmesg and the panic output.

Any hints what to do next are very welcome...! 
Comment 7 Mathis Ahrens 2008-08-28 04:56:07 UTC
Created attachment 17503 [details]
config-2.6.27-rc4 (UP)
Comment 8 Mathis Ahrens 2008-08-28 04:57:00 UTC
Created attachment 17504 [details]
dmesg-2.6.27-rc4 (UP)
Comment 9 Mathis Ahrens 2008-08-28 04:57:36 UTC
Created attachment 17505 [details]
panic-2.6.27-rc4 (UP)
Comment 10 Tejun Heo 2008-08-29 03:25:54 UTC
Hmm... can you try to swap out memory sticks and see whether that makes any difference?  e.g. If you have two sticks on the machine try the test with only one stick at a time and see whether there's any correlation with the problem?
Comment 11 Mathis Ahrens 2008-08-29 09:46:21 UTC
Oops.
I was wrong when I said 2.6.22 would work.
(tried the wrong config, sorry!)
Truth is, 2.6.22 /may/ work. 
It behaves just like 2.6.27-rc4.

On the bright side, I found the config options that actually make a difference:

+CONFIG_IDE=y
+CONFIG_BLK_DEV_IDE=y
+CONFIG_IDE_TIMINGS=y
+CONFIG_BLK_DEV_IDEDISK=y
+CONFIG_BLK_DEV_IDECD=y
+CONFIG_BLK_DEV_IDECD_VERBOSE_ERRORS=y
+CONFIG_IDE_PROC_FS=y
+CONFIG_BLK_DEV_IDEDMA_SFF=y
+CONFIG_BLK_DEV_IDEPCI=y
+CONFIG_IDEPCI_PCIBUS_ORDER=y
+CONFIG_BLK_DEV_IDEDMA_PCI=y
+CONFIG_BLK_DEV_VIA82CXXX=y
+CONFIG_BLK_DEV_IDEDMA=y
-CONFIG_ATA=y
-CONFIG_SATA_PMP=y
-CONFIG_ATA_SFF=y
-CONFIG_ATA_GENERIC=y
-CONFIG_PATA_VIA=y

It seems clear that libata/PATA_VIA is guilty.
In PIO mode it is too slow, in DMA mode it panics.
Reverting to old IDE drivers fixes both problems.

I went back until 2.6.22 just now, and i am beginning
to believe that PATA_VIA never /ever/ worked on my hardware.
Fair enough, since it's marked experimental, but in my
case, it's definitely broken.

If somebody should be interested in fixing PATA_VIA,
I'd be happy to provide further information and test.

(Tejun, in case it still matters: it's a laptop, cannot swap memory)
Comment 12 Mathis Ahrens 2008-08-30 01:42:00 UTC
Dunno if this is to be expected, but I noticed a difference in 
PCI config space, pata vs. ide:

--- lspci-27-rc4	2008-08-30 10:07:20.000000000 +0200
+++ lspci-27-rc4ide	2008-08-30 10:14:52.000000000 +0200

00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) (prog-if 8a [Master SecP PriP])
	Subsystem: VIA Technologies, Inc. VT82C586/B/VT82C686/A/B/VT8233/A/C/VT8235 PIPC Bus Master IDE
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
	Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
	Latency: 32
	Interrupt: pin A routed to IRQ 14
	Region 0: [virtual] Memory at 000001f0 (32-bit, non-prefetchable) [size=8]
	Region 1: [virtual] Memory at 000003f0 (type 3, non-prefetchable) [size=1]
	Region 2: [virtual] Memory at 00000170 (32-bit, non-prefetchable) [size=8]
	Region 3: [virtual] Memory at 00000370 (type 3, non-prefetchable) [size=1]
	Region 4: I/O ports at fc00 [size=16]
	Capabilities: [c0] Power Management version 2
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
 00: 06 11 71 05 07 00 90 02 06 8a 01 01 00 20 00 00
 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 20: 01 fc 00 00 00 00 00 00 00 00 00 00 06 11 71 05
 30: 00 00 00 00 c0 00 00 00 00 00 00 00 ff 01 00 00
-40: 0b f2 09 05 18 1c c0 00 99 20 99 20 ff 00 20 20
-50: 17 f6 17 f1 0c 00 00 00 a8 a8 a8 a8 00 00 00 00
+40: 0b f2 09 05 18 1c c0 00 a8 20 a8 20 ff 00 20 20
+50: 17 e6 17 e1 0c 00 00 00 a8 a8 a8 a8 00 00 00 00
 60: 00 02 00 00 00 00 00 00 00 02 00 00 00 00 00 00
 70: 02 01 00 00 00 00 00 00 02 01 00 00 00 00 00 00
-80: 00 20 fa 0e 00 00 00 00 00 30 fa 0e 00 00 00 00
+80: 00 f0 81 1e 00 00 00 00 00 80 8b 1e 00 00 00 00
 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Comment 13 Tejun Heo 2008-08-30 03:03:14 UTC
Cc'ing Alan.  Hello again Alan.  Another data corruption issue.  It's now pata_via and pata_via seems to somehow manage to corrupt kernel code while via82xxx doesn't cause any problem.  Comment #12 (right above this one) shows there's some difference in PCI configuration area between the two drivers.  Any ideas?  Thanks.
Comment 14 Alan 2008-08-30 04:19:36 UTC
48 -> 99 20 99 20 - active pulse width/recovery for DIOR/DIOW on data xfers [PCI clocks]
A8A8A8A8 is the chip boot up default

So old IDE programmed the drive timing control we didn't. Only used for PIO however and A8 is more conservative than 99 and way more than 20.

50 -> 17 f6 17 f1 v 17 e6 17 e1 - UDMA timing : old IDE has stomped on the cable reporting bit

The 80 difference is te debug register dump of the S/G descriptor addresss

Nothing I can see relevant there although you can certainly safely poke those bits back with setpci before testing (except the 0x80 stuff which is just debug)
Comment 15 Alan 2008-08-30 04:24:26 UTC
"Most likely, the actual code fetched from I-cache was different and was
corrupted, and code sequence we print in the above trace is coming from
L2/D-cache and probably different from what was tried to be executed."

That comment of Venkatesh implies a hardware bug - x86 DMA is coherent so the ATA layer cannot break this as is suggested. Wonder if its got busted SMM handlers.

What happens if you boot with acpi=off, another test on current kernels would be to boot with io_delay=udelay in case this is like one or two other annoying HP laptops. Not sure why libata should happen to trigger either.

Exactly which model is this laptop ?
Comment 16 Alan 2008-08-30 04:25:23 UTC
Oh and I assume you've tried an overnight run of memtest86 ?

Alan
Comment 17 Tejun Heo 2008-08-30 04:33:09 UTC
(In reply to comment #15)
> "Most likely, the actual code fetched from I-cache was different and was
> corrupted, and code sequence we print in the above trace is coming from
> L2/D-cache and probably different from what was tried to be executed."

I-cache isn't coherent, so if the particular code region is being overwritten while the cpu is at it, there is really no way to exactly tell what the CPU is executing.

> Oh and I assume you've tried an overnight run of memtest86 ?

And yeah, please do that to rule out memory problem.
Comment 18 Alan 2008-08-30 04:50:24 UTC
Impossible no - statistically impossible yes ...
Comment 19 Mathis Ahrens 2008-08-31 03:30:15 UTC
Thanks for looking at this.
Some answers to your questions:

This is an Fujitsu-Siemens Amilo A 7620.

ACPI is not compiled in any more, see comment #6

I ran memtest for 2 hours (2.5 passes), no errors.

Tried "io_delay=delay", no effect.

Switched to rc5, no difference 

Played with the config: disabled PortMultiplier and tried CONFIG_IO_DELAY_0X80 no effect.



Anything else I could try?
Comment 20 Mathis Ahrens 2008-08-31 03:37:51 UTC
Oh, and I flipped the pci config space bits back, but also no effect as expected.
Comment 21 Mathis Ahrens 2008-09-01 14:43:28 UTC
Hrmm.
In order to make things absolutely comparable, I switched to using a single kernel, with different initramfs.
Everything is compiled in, except via82cxxx / pata_via, and only one of those is  loaded from initramfs.
The results remain:


                             |        via82cxxx          | 
                             |            or             |    pata_via 
                             | pata_via AND libata.dma=1 |  
-----------------------------+---------------------------+----------------+
cp /hdisk/file /hdisk/file2  |                           |                |
cp /usbstorage/file /hdisk   |       NEVER PANICS        |  NEVER PANICS  |
anything else without cdrom  |                           |                |
-----------------------------+---------------------------+----------------+
cp /cdrom/file  /hdisk       |                           |                |
cp /cdrom/file  /dev/null    |       NEVER PANICS        | ALWAYS PANICS  |
cat /cdrom/file              |                           |                |
-----------------------------+---------------------------+----------------+

My not very educated conclusion is, the libata stack's dma is somehow involved in the memory corruption that Venkatesh diagnosed.
Are there any other realistic possibilites left?

If not, how could I get more useful information about the cause of this corruption?
Any debug switches?
Any other method to narrow it down?

Thanks!
Comment 22 Alan 2008-09-01 15:22:04 UTC
I'm really at a loss. The symptoms you describe simply don't fit any rational explanation. There are known problems with those boxes (often needing noapic nolapic acpi=off) but nothing that would explain this and the controller code is rock solid elsewhere yet this code is totally shared. Also the two IDE stacks produce almost exactly the same DMA command sequences for the device.

You could try switching the device to the dumb sg builder

add

        .qc_prep        = ata_sff_dumb_qc_prep,

in via_port_ops

and
       .sg_tablesize = LIBATA_DUMB_MAX_PRD

but I doubt that'll have any effect

Alan 
Comment 23 Mathis Ahrens 2008-09-05 14:51:05 UTC
Correct,  those don't help either:
       .qc_prep        = ata_sff_dumb_qc_prep,
       .sg_tablesize = LIBATA_DUMB_MAX_PRD
Comment 24 Thierry Lathuille 2008-09-18 07:55:39 UTC
Hello,
I've got the same hardware in my desktop computer and I'm experiencing problems as well. 

After copying a few MB :

- with "noapic nolapic", I get the same hard lockups

- otherwise, only the keyboard, mouse and network stop working, while everything else goes on (copying, all apps...). Starting  itop before this kbd-mouse freeze shows no more interrupts from the i8042. 

If it can help anyone to see clearer in this...

BTW, everything works well with via82cxxx

# lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8235 PCI Bridge
00:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)
00:0b.0 Multimedia audio controller: Creative Labs SB Audigy LS
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80)
00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:12.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 74)
01:00.0 VGA compatible controller: nVidia Corporation NV18 [GeForce4 MX 440 AGP 8x] (rev a2)
Comment 25 Mathis Ahrens 2008-09-19 05:30:54 UTC
Thanks for reporting Thierry, I feel a bit less alone now! ;)

Could you elaborate a bit on when exactly you are hitting
lockups, i.e.
- Does pata_via run ok as long as the cdrom is not involved?
- Have you tried to disable dma?
(see comment #21 for a table of my experiences)

If you could somehow capture the actual panic, that would
be most valuable!
Comment 26 da_norf 2008-11-28 00:45:49 UTC
Hello,
Same problem here on a Sony Vaio laptop PCG-FR102 (IDE chipset : Apollo KT266/A/333) running under Fedora.

Since Fedora 7 (and going on Fedora 8 and 9) (or, if you prefer, since the kernel of this distro start using pata_via dirver) :
 - When transfering large amount of data from CD-ROM (install from CD or DVD, ripping music form CD, Reading DVD) : laptop freeze.
 - When doing multiple things (openning windows, menus, launching more apps) when data are already read or writen on the HD : laptop freeze.
 - Some (very rare) times : laptop freeze during boot.

Same behaviour experienced with different HDs and CD/DVD readers.
Memtest OK.
As said before, this laptop was running well (and still do) with fedora FC6 and earlier versions (at least from the IDE controller point of view).
Comment 27 hoplite 2009-02-27 02:34:09 UTC
Good day
I have Fujitsu-Siemens Amilo A 7620 laptop and I suffer from exactly same problems as Mathis so I guess this is probably not caused by broken hardware such as bad memory. I have run Memtest with no errors.

I'm running Ubuntu 8.10 with 2.6.27-11 kernel. I don't know much about Linux but I'd be glad to help if there is something I can do. 
Comment 28 Avery Fay 2010-07-15 05:38:56 UTC
Anything that I could do to help this bug get fixed?

I have an Averatec 3200 series laptop that I can't even install Linux on anymore because all the distributions are using pata_via instead of the old via82cxxx driver (which works). Also, this seems be hitting everyone with my laptop judging by the (unresolved) threads on ubuntu forums.

The symptoms are exactly like above. Any CD access quickly causes an oops. HD access seems to be okay. I haven't tried disabling dma yet.

Also, the backtrace varies a bit. Sometimes I get the acpi_idle_enter_simple one. Sometimes I get other ones. However, my memory is fine and the old via driver never crashes.
Comment 29 Tejun Heo 2010-07-15 09:53:56 UTC
Avery, what was the latest distro you tried?
Comment 30 Avery Fay 2010-07-15 18:58:23 UTC
The latest I've tried is Ubuntu 10.04.

I've also used the Debian squeeze beta installer, but interestingly enough the installer kernel in Debian is still compiled with the old via IDE driver so everything works. Of course, as soon as you apt-get upgrade, you get a newer kernel with pata_via.

I also verified last yesterday that disabling ATAPI dma via libata.dma=1 prevents the crash.
Comment 31 Mathis Ahrens 2010-08-15 00:43:57 UTC
I thought I mention that I do not use that particular laptop anymore, partly due to this issue.
But I noticed just now that in comment #19 I wrote "io_delay=delay" instead of "io_delay=udelay".
Maybe I mistyped on the kernel command line, too.
Avery, you might give that a try if you are still motivated...
Comment 32 Jim Bray 2011-07-24 23:41:52 UTC
I have an Averatec 3220, and I have to use the Deprecated ATA drivers because using a cd usually freezes the machine fairly quickly. My guess would be that cd error/retry code is involved, since the crash is not immediate and 100% reproducible, but that is strictly a guess.
 
This bug is still present in 2.6.39.3.
Comment 33 Jim Bray 2011-07-24 23:44:49 UTC
Further note: the freeze occurs with recent Ubuntu and Debian kernels, in addition to my hand-built kernels. Only workaround I've found is to build a kernel with ATA and no SATA/PATA support.
Comment 34 Tejun Heo 2011-07-27 12:37:38 UTC
Created attachment 66832 [details]
large-dma-pad.patch

Urgh... I hate that controller. Can you please see whether the attached patch makes any difference?

Thanks.
Comment 35 Jim Bray 2011-07-27 20:13:01 UTC
This bug is still alive in 3.0.0.
My test is trying to rip a CD with grip.
Results:
Using deprecated ATA/IDE driver, works.
Using via-pata stock, or PAD_SZ=32, immediate total freeze.
PAD_SZ=128, ran for part of first track, then froze.
Going to try with pad size of 1024.
Comment 36 Jim Bray 2011-07-27 21:13:36 UTC
With pad size of 1024, immediate freeze.
Tried eliminating via-pata and using generics, but couldn't boot.
I'm using a CD which is readable but produces lots of soft errors.
Comment 37 Jim Bray 2011-07-27 21:23:31 UTC
BTW, pad size refers to libata.h:ATA_DMA_PAD_SZ, and thanks for the patch.
Comment 38 Tejun Heo 2011-07-28 07:44:58 UTC
Dang, I'm out of ideas. Maybe we should just disable DMA on ATAPI devices on pata_via?  It's a sucky solution but at least better than freezing the whole machine. Sorry. :(
Comment 39 Alan 2011-07-28 09:18:10 UTC
That would be a regression for the 99.99% of people who seem to have working VIA PATA ATAPI so wouldn't really be acceptable, especially given the interrupt driven PIO handling in libata does too much with IRQs off.

It seems to affect two specific laptops so presumably is some kind of firmware/hardware problem specific to them. Certainly some of the reports look very hardware/firmware-ish. We could blacklist the two specific laptops (the Amilo A 7620 and the Acertec 3200/3220).. 

I wonder if these are actually two brandings of the same laptop in fact ?

Alan
Comment 40 Tejun Heo 2011-07-28 09:26:55 UTC
Yeap, finer grained blacklisting would be much better. Jim, can you please post the output of dmidecode and "lspci -nnvvv"?
Comment 41 Jim Bray 2011-07-28 15:03:42 UTC
Created attachment 67002 [details]
dmidecode

dmidecode
Comment 42 Jim Bray 2011-07-28 15:04:55 UTC
Created attachment 67012 [details]
lspci -nnvvv
Comment 43 Jim Bray 2011-07-28 15:07:48 UTC
Created attachment 67022 [details]
dmesg pata kernel with ata_dma_pad_sz=1024

Below is some interesting chatter from dmesg (attached) which I have not seen before. This is with no cd in the drive.

cdrom: Uniform CD-ROM driver Revision: 3.20
sr 1:0:0:0: Attached scsi CD-ROM sr0
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: ST-ATA: DRQ=0 without device error, dev_stat 0x50
ata2.00: failed command: IDENTIFY PACKET DEVICE
ata2.00: cmd a1/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 1024 in
         res 50/00:02:00:00:02/00:00:00:00:00/00 Emask 0x202 (HSM violation)
ata2.00: status: { DRDY }
ata2: soft resetting link
ata2.00: configured for UDMA/33
ata2: EH complete
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: ST-ATA: DRQ=0 without device error, dev_stat 0x50
ata2.00: failed command: IDENTIFY PACKET DEVICE
ata2.00: cmd a1/00:01:00:00:00/00:00:00:00:00/00 tag 0 pio 1024 in
         res 50/00:02:00:00:02/00:00:00:00:00/00 Emask 0x202 (HSM violation)
ata2.00: status: { DRDY }
ata2: soft resetting link
ata2.00: configured for UDMA/33
ata2: EH complete
Comment 44 Jim Bray 2011-07-28 15:13:58 UTC
  Correction: I was in fact able to boot and test with ATA_DMA_PAD_SZ=1024. It failed at about the same place on the cd as with _SZ=4.

The previously reported boot failure(s) were due to Debian update-initrd not including the scsi disk module in the initrd. Don't know why it was working before. I'll retry the generic pata options again.
Comment 45 Jim Bray 2011-07-28 15:37:55 UTC
I can't test with generic pata. pata_acpi gives a message like the one I get below with the currently running kernel,
VIA_IDE 0000:00:11.1: can't derive routing for PCI INT A
and says no more.
Comment 46 Tejun Heo 2011-07-30 12:33:24 UTC
Created attachment 67172 [details]
via-no-atapi-dma-on-averatec3200.patch

Can you please apply the attached patch and post the kernel boot log?  It should disable DMA on your machine.

Thanks.
Comment 47 Jim Bray 2011-07-30 17:53:39 UTC
Created attachment 67182 [details]
screen-shot at freeze

I applied the patch, and reverted the ata_dma_pad_sz value to 4.
I booted twice, and the system froze, apparently immediately upon configuring
the device for MWDMA2. On the first freeze, I immediately did alt-sysrq-? and got the expected output, then no other alt-sysrq commands worked. On the second freeze, no alt-sysrq commands worked.
I took a picture of the screen at freeze and reduced it with gimp down to a reasonable size and attached it.
Comment 48 Tejun Heo 2011-07-30 18:00:29 UTC
Created attachment 67192 [details]
via-no-atapi-dma-on-averatec3200-1.patch

Sorry, wrong mask. Can you please try this one?
Comment 49 Jim Bray 2011-07-30 18:30:49 UTC
OK, applied by hand. This is the code I will try:

        if (dev->class == ATA_DEV_ATAPI &&
            dmi_check_system(no_atapi_dma_dmi_table)) {
                // UNDEFINED ata_dev_warn(dev, "controller locks up on ATAPI DMA, forcing PIO\n");
                ata_dev_printk(dev, KERN_WARNING, "controller locks up on ATAPI DMA, forcing PIO\n");
                mask &= ATA_MASK_PIO;;
        }
Comment 50 Jim Bray 2011-07-30 18:31:24 UTC
(minus the extra semicolon)
Comment 51 Jim Bray 2011-07-30 19:57:51 UTC
 Ran a test in single-user mode with the pata_via PIO patch, using
cdparanoia -v 1 /dev/null
multiple times, worked fine. 

Decided to compare to unpatched kernel, which also worked fine with cdparanoia in single-user.

Unpatched kernel always freezes at same place in test cd, using grip (which uses cdparania) to read cd, while running X, multi-user, without jiggling cursor with touchpad.

Running in an xterm, cdparania worked, with multiple passes, *until I moved the cursor around with the Synaptics touchpad*.

  Kernel with pata_via PIO patch works fine with all tests. This kernel is built with a lot of debugging configured, which may affect the results.

This reminds me of a problem I ran into in about 2000 with this laptop,which I can't find in kernel bugzilla, but did report, maybe in LKML. The gist of this problem was some sort of conflict between X (meaning graphics device) and some other device, think it was Wifi. I *think* this involved an irq/dma problem. I *think* the workaround at the time was disabling ACPI. The current problem may be similar. Maybe Openchrome AGP clashing with the ATAPI device?

BTW, the Amilo A7620 is a different beast than the Averatec 3200 series.
http://www.dooyoo.co.uk/laptops/fujitsu-siemens-amilo-a-7620/details/
It does have a VIA motherboard and probably a VIA chipset.
Comment 52 Jim Bray 2011-07-30 21:39:59 UTC
I made an unpatched kernel so stripped down that I discovered that a kernel without networking enabled won't boot (udevd needs sockets). With that kernel, and using
dd of=/dev/scd0 of=/dev/null
and a recent Debian install disc, I am somewhat persuaded that this bug involves a bad interaction of the controller the ATAPI drive is on and the graphics hardware.

The dd works fine in text-mode, single or multiuser.

The dd hangs the system when run in an xterm, even with no mouse motion or other input.

  The system, when hung, still responds to the Wifi button (turns Wifi light on and off) and the power button (apparently tries to suspend system, power light blinks, console goes to white-blank).

  This is with no ACPI, no PM, non-preemptible kernel, no wifi.
Comment 53 Tejun Heo 2011-08-04 09:18:05 UTC
Patch posted upstream. Yeah, ISTR another data corruption issue w/ pata_via. At this point, I think blacklisting is the best solution.

  http://article.gmane.org/gmane.linux.ide/49983

Thanks.
Comment 54 Florian Mickler 2011-08-23 20:20:01 UTC
A patch referencing this bug report has been merged in Linux v3.1-rc3:

commit 6d0e194d2eefcaab6dbdca1f639748660144acb5
Author: Tejun Heo <tj@kernel.org>
Date:   Thu Aug 4 11:15:07 2011 +0200

    pata_via: disable ATAPI DMA on AVERATEC 3200
Comment 55 Ric 2011-08-24 15:13:53 UTC
Tejun Heo:

Thank you for your patch. Could you please add:

dmi.product.name: PCG-FR285E(FR)
dmi.sys.vendor: Sony Corporation

to the blacklist ? That would definitely close bug 486063 on launchpad.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/486063
Comment 56 Tejun Heo 2011-08-26 13:44:02 UTC
Ric, have you verified that turning off DMA for the cdrom removes the reported hang? If so, I'll add it right away.

Thanks.
Comment 57 Ric 2011-08-26 15:19:15 UTC
Hi Tejun Heo,

I have not, and unfortunately I won't have access to this laptop before the end of the year when I visit my mother in law :)

For now, it uses a Lucid kernel compiled with the deprecated IDE module.

I'll post here as soon as I can test it with a stock kernel and DMA turned off.

Note You need to log in before you can comment on or make changes to this bug.