Bug 7486 - sata_promise/hdparm freeze on setting write_cache off
Summary: sata_promise/hdparm freeze on setting write_cache off
Status: REJECTED DUPLICATE of bug 7412
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Jeff Garzik
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-10 06:32 UTC by martin f. krafft
Modified: 2007-01-11 02:05 UTC (History)
0 users

See Also:
Kernel Version: 2.6.18.2
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
lspci -vv output (2.94 KB, application/octet-stream)
2006-11-10 06:36 UTC, martin f. krafft
Details

Description martin f. krafft 2006-11-10 06:32:42 UTC
Most recent kernel where this bug did not occur: 2.6.17.x
Distribution: Debian sid
Hardware Environment: Athlon XP 64 3500+, VIA ,K8T800Pro chipset, SWRAID10 on 4
SATA disks, two on VIA VT6420, two on PDC20378, Adaptec AIC-7892A with a tape
drive attached. lspci.bz2 attached.
Software Environment: not sure what you want here
Problem Description:

If I try to use hdparm to disable the write cache of the two devices connected
to the Promise/FastTrak controller (which is being used not as a HW-RAID
controller, but rather as provider of two separate SATA channels), the system
freezes hard.

lspci -v (full lspci -vv attached):
  [...]
  00:08.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak 378/SAT
  A 378) (rev 02)
    Subsystem: ASUSTeK Computer Inc. K8V Deluxe/PC-DL Deluxe motherboard
    Flags: bus master, 66MHz, medium devsel, latency 96, IRQ 177
    I/O ports at 8800 [size=64]
    I/O ports at 8400 [size=16]
    I/O ports at 8000 [size=128]
    Memory at fb300000 (32-bit, non-prefetchable) [size=4K]
    Memory at fb200000 (32-bit, non-prefetchable) [size=128K]
    Capabilities: [60] Power Management version 2

If I comment out write_cache=off for sd[gh] (those are the two Promise
drives), the system boots fine. I can also set -W0 on sd[ef], which are
connected to a sata_via controller (see lspci attachment for details).

If I run hdparm -W0 on sdg or sgh, I get a panic:

  Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: 
   [<ffffffff8818c642>] :sata_promise:pdc_eng_timeout+0x62/0x18d
  PGD 35fb2067 PUD 3585d067 PMD 0 
  Oops: 0000 [1] SMP 
  CPU 0 
  Modules linked in: rfcomm l2cap button ac battery ipv6 ipt_MASQUERADE
iptable_nat ipt_REJECT ipt_addrtype ipt_LOG xt_limit xt_tcpudp xt_conntrack
ip_nat_ftp ip_nat ip_conntrack_ftp ip_conntrack nfnetlink iptable_filter
ip_tables x_tables netconsole snd_seq_dummy snd_seq_oss snd_seq_midi
snd_seq_midi_event snd_seq snd_via82xx tsdev serio_raw snd_bt87x
snd_via82xx_modem snd_ac97_codec snd_pcm_oss snd_mixer_oss evdev snd_mpu401_uart
snd_pcm psmouse snd_rawmidi snd_seq_device snd_timer snd soundcore eth1394
pcspkr floppy ext3 jbd mbcache dm_mirror dm_snapshot dm_mod raid10 raid1 md_mod
ide_generic ide_cd cdrom skge sd_mod hci_usb bluetooth usbhid usb_storage bt878
via82cxxx ohci1394 shpchp pci_hotplug ieee1394 sata_promise sk98lin sata_via
aic7xxx scsi_transport_spi bttv video_buf firmware_class ir_common
compat_ioctl32 i2c_algo_bit btcx_risc tveeprom videodev v4l1_compat v4l2_common
libata scsi_mod generic ide_core uhci_hcd ehci_hcd i2c_viapro i2c_core gameport
snd_ac97_bus snd_page_alloc thermal processor fan
  Pid: 1129, comm: scsi_eh_4 Not tainted 2.6.18-2-amd64 #1
  RIP: 0010:[<ffffffff8818c642>]  [<ffffffff8818c642>]
:sata_promise:pdc_eng_timeout+0x62/0x18d
  RSP: 0018:ffff81003d86fe40  EFLAGS: 00010096
  RAX: 00000000fafbfcfd RBX: ffff81003e080000 RCX: 000000000000acd4
  RDX: 00000000ffffff01 RSI: 0000000000000046 RDI: ffff81003e3461c0
  RBP: ffff81003e0804e8 R08: ffffffff804dc140 R09: 0000000000000012
  R10: ffff81003d86fe08 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff81003e3461c0 R14: 0000000000000246 R15: 0000000000000005
  FS:  00002ad83985c8c0(0000) GS:ffffffff80520000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
  CR2: 0000000000000028 CR3: 0000000035c4f000 CR4: 00000000000006e0
  Process scsi_eh_4 (pid: 1129, threadinfo ffff81003d86e000, task ffff810037f55770)
  Stack:  ffff81003e080000 ffff81003e0804e8 ffff81003df05ac8 ffff81003e080000
   ffff81003e080000 ffffffff880a4f11 0000000000000282 ffff81003e080000
   ffffffff880767ed ffff81003df05ac8 ffff81003e080000 ffff81003df05ab8
  Call Trace:
   [<ffffffff880a4f11>] :libata:ata_scsi_error+0x418/0x50b
   [<ffffffff880767ed>] :scsi_mod:scsi_error_handler+0x0/0xa81
   [<ffffffff80290195>] keventd_create_kthread+0x0/0x61
   [<ffffffff880768ac>] :scsi_mod:scsi_error_handler+0xbf/0xa81
   [<ffffffff80290195>] keventd_create_kthread+0x0/0x61
   [<ffffffff880767ed>] :scsi_mod:scsi_error_handler+0x0/0xa81
   [<ffffffff80290195>] keventd_create_kthread+0x0/0x61
   [<ffffffff8023055a>] kthread+0xd4/0x107
   [<ffffffff80259318>] child_rip+0xa/0x12
   [<ffffffff80290195>] keventd_create_kthread+0x0/0x61
   [<ffffffff80230486>] kthread+0x0/0x107
   [<ffffffff8025930e>] child_rip+0x0/0x12
  
  
  Code: 41 8a 44 24 28 3c 01 74 0d 3c 03 bb e8 03 00 00 0f 85 93 00 
  RIP  [<ffffffff8818c642>] :sata_promise:pdc_eng_timeout+0x62/0x18d
   RSP <ffff81003d86fe40>
  CR2: 0000000000000028
   NMI Watchdog detected LOCKUP on CPU 0
  CPU 0 
  Modules linked in: rfcomm l2cap button ac battery ipv6 ipt_MASQUERADE
iptable_nat ipt_REJECT ipt_addrtype ipt_LOG xt_limit xt_tcpudp xt_conntrack
ip_nat_ftp ip_nat ip_conntrack_ftp ip_conntrack nfnetlink iptable_filter
ip_tables x_tables netconsole snd_seq_dummy snd_seq_oss snd_seq_midi
snd_seq_midi_event snd_seq snd_via82xx tsdev serio_raw snd_bt87x
snd_via82xx_modem snd_ac97_codec snd_pcm_oss snd_mixer_oss evdev snd_mpu401_uart
snd_pcm psmouse snd_rawmidi snd_seq_device snd_timer snd soundcore eth1394
pcspkr floppy ext3 jbd mbcache dm_mirror dm_snapshot dm_mod raid10 raid1 md_mod
ide_generic ide_cd cdrom skge sd_mod hci_usb bluetooth usbhid usb_storage bt878
via82cxxx ohci1394 shpchp pci_hotplug ieee1394 sata_promise sk98lin sata_via
aic7xxx scsi_transport_spi bttv video_buf firmware_class ir_common
compat_ioctl32 i2c_algo_bit btcx_risc tveeprom videodev v4l1_compat v4l2_common
libata scsi_mod generic ide_core uhci_hcd ehci_hcd i2c_viapro i2c_core gameport
snd_ac97_bus snd_page_alloc thermal processor fan
  Pid: 2817, comm: md5_raid10 Not tainted 2.6.18-2-amd64 #1
  RIP: 0010:[<ffffffff8025e8c6>]  [<ffffffff8025e8c6>] .text.lock.spinlock+0x2/0x8a
  RSP: 0018:ffffffff804bfde0  EFLAGS: 00000086
  RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
  RDX: ffffffff804bfe98 RSI: ffff81003e3461c0 RDI: ffff81003e3461c0
  RBP: ffffc20000036000 R08: ffff81003eedc000 R09: 0000000000000246
  R10: 0000000000000000 R11: ffff810037ada770 R12: 0000000000000000
  R13: 00000000000000b1 R14: ffff81003e3461c0 R15: ffffffff804bfe98
  FS:  00002ad83985c8c0(0000) GS:ffffffff80520000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
  CR2: 0000000000000028 CR3: 0000000035c4f000 CR4: 00000000000006e0
  Process md5_raid10 (pid: 2817, threadinfo ffff81003eedc000, task ffff8100011fd870)
  Stack:  ffffffff8818c172 ffffffff80257af1 ffff81003dff2340 0000000000000000
   0000000000000000 00000000000000b1 ffffffff804bfe98 ffffffff804bfe98
   ffffffff8020f0f4 ffffffff80528d00 0000000000005880 00000000000000b1
  Call Trace:
   <IRQ> [<ffffffff8818c172>] :sata_promise:pdc_interrupt+0x3b/0x1d9
   [<ffffffff80257af1>] blk_run_queue+0x28/0x72
   [<ffffffff8020f0f4>] handle_IRQ_event+0x29/0x58
   [<ffffffff802a4302>] __do_IRQ+0xa4/0x105
   [<ffffffff88077dd7>] :scsi_mod:scsi_io_completion+0x156/0x334
   [<ffffffff80263fdf>] do_IRQ+0x65/0x73
   [<ffffffff80258989>] ret_from_intr+0x0/0xa
   [<ffffffff80210376>] __do_softirq+0x53/0xd5
   [<ffffffff8026e567>] end_level_ioapic_vector+0x9/0x16
   [<ffffffff80259664>] call_softirq+0x1c/0x28
   [<ffffffff80264019>] do_softirq+0x2c/0x7d
   [<ffffffff80263fe4>] do_IRQ+0x6a/0x73
   [<ffffffff80258989>] ret_from_intr+0x0/0xa
   <EOI> [<ffffffff8020b1c4>] memcmp+0xb/0x22
   [<ffffffff882a4522>] :raid10:raid10d+0x233/0x9da
   [<ffffffff80290195>] keventd_create_kthread+0x0/0x61
   [<ffffffff8025d504>] schedule_timeout+0x1e/0xad
   [<ffffffff80290195>] keventd_create_kthread+0x0/0x61
   [<ffffffff8828ac2a>] :md_mod:md_thread+0xf8/0x10e
   [<ffffffff80290358>] autoremove_wake_function+0x0/0x2e
   [<ffffffff8828ab32>] :md_mod:md_thread+0x0/0x10e
   [<ffffffff8023055a>] kthread+0xd4/0x107
   [<ffffffff80259318>] child_rip+0xa/0x12
   [<ffffffff80290195>] keventd_create_kthread+0x0/0x61
   [<ffffffff80230486>] kthread+0x0/0x107
   [<ffffffff8025930e>] child_rip+0x0/0x12
  
  
  Code: 83 3f 00 7e f9 e9 6d fe ff ff e8 ff d7 ff ff e9 7d fe ff ff 
  console shuts up ...
   <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
   <0>Rebooting in 60 seconds..

Curiously, the last two lines do not always appear; sometimes the
system also just remains frozen forever.

sdg is a Maxtor 250Gb SATA drive at UDMA6
sdh is a Samsung 250Gb SATA drive at UDMA7

One difference about these is that the RAID10 array holding the swap
partition only spans sdg[efg] and does not touch sdh.

Both drives are healthy according to smartctl. This is what dmesg
knows about them:

  sata_promise 0000:00:08.0: version 1.04
  ACPI: PCI Interrupt 0000:00:08.0[A] -> GSI 18 (level, low) -> IRQ 177
  ata3: SATA max UDMA/133 cmd 0xFFFFC20000036200 ctl 0xFFFFC20000036238 bmdma
0x0 irq 177
  ata4: SATA max UDMA/133 cmd 0xFFFFC20000036280 ctl 0xFFFFC200000362B8 bmdma
0x0 irq 177
  scsi4 : sata_promise
  ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
  ata3.00: ATA-7, max UDMA/133, 490234752 sectors: LBA48 
  ata3.00: ata3: dev 0 multi count 0
  ata3.00: configured for UDMA/133
  scsi5 : sata_promise
  ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
  ata4.00: ATA-7, max UDMA7, 488397168 sectors: LBA48 NCQ (depth 0/32)
  ata4.00: configured for UDMA/133
    Vendor: ATA       Model: Maxtor 7Y250M0    Rev: YAR5
    Type:   Direct-Access                      ANSI SCSI revision: 05
  SCSI device sdg: 490234752 512-byte hdwr sectors (251000 MB)
  sdg: Write Protect is off
  sdg: Mode Sense: 00 3a 00 00
  SCSI device sdg: drive cache: write back
  SCSI device sdg: 490234752 512-byte hdwr sectors (251000 MB)
  sdg: Write Protect is off
  sdg: Mode Sense: 00 3a 00 00
  SCSI device sdg: drive cache: write back
   sdg: sdg1 sdg2 sdg3 < sdg5 sdg6 sdg7 sdg8 sdg9 sdg10 >
  sd 4:0:0:0: Attached scsi disk sdg
    Vendor: ATA       Model: SAMSUNG SP2504C   Rev: VT10
    Type:   Direct-Access                      ANSI SCSI revision: 05
  SCSI device sdh: 488397168 512-byte hdwr sectors (250059 MB)
  sdh: Write Protect is off
  sdh: Mode Sense: 00 3a 00 00
  SCSI device sdh: drive cache: write through
  SCSI device sdh: 488397168 512-byte hdwr sectors (250059 MB)
  sdh: Write Protect is off
  sdh: Mode Sense: 00 3a 00 00
  SCSI device sdh: drive cache: write through
   sdh: sdh1 sdh2 < sdh5 sdh6 sdh7 sdh8 sdh9 sdh10 >
  sd 5:0:0:0: Attached scsi disk sdh

Steps to reproduce:
  boot 2.6.18.2 kernel,
  hdparm -W0 /dev/sdg

  Works fine on 2.6.17.x
Comment 1 martin f. krafft 2006-11-10 06:36:30 UTC
Created attachment 9451 [details]
lspci -vv output
Comment 2 martin f. krafft 2006-11-10 06:57:50 UTC
Also see http://bugs.debian.org/391929 -- unfortunately I found no way to add
391929-quiet@bugs.debian.org to the CC list of thus bug.
Comment 3 martin f. krafft 2007-01-10 12:21:04 UTC
I cannot seem to reproduce this on 2.6.20-rc4.
Comment 4 dann frazier 2007-01-10 17:52:41 UTC
Any idea what changeset(s) fixed it?
Comment 5 martin f. krafft 2007-01-11 02:03:51 UTC
Tejun Heo thinks it's this one, but he isn't sure:
http://article.gmane.org/gmane.linux.ide/14188
Comment 6 martin f. krafft 2007-01-11 02:05:24 UTC

*** This bug has been marked as a duplicate of 7412 ***

Note You need to log in before you can comment on or make changes to this bug.