Bug 6737

Summary: 2.6.17.1 shows dma_intr: status=0x58 for /dev/hda once on boot
Product: IO/Storage Reporter: Martin Steigerwald (Martin)
Component: IDEAssignee: Bartlomiej Zolnierkiewicz (bzolnier)
Status: CLOSED INSUFFICIENT_DATA    
Severity: normal CC: akpm, protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.17.1 Subsystem:
Regression: --- Bisected commit-id:
Attachments: output of lspci --vvn
configuration of kernel 2.6.17.1 I used (plain vanilla, without sws2)
configuration of the older kernel (2.6.15.7-sws-2.2) that did not show those dma_intr messages
configuration of more recent older kernel (2.6.16.11) that does not show dma_intr messages

Description Martin Steigerwald 2006-06-22 23:54:16 UTC
Most recent kernel where this bug did not occur: no later one tested
Distribution: Debian Etch/Sid
Hardware Environment: IBM ThinkPad T23
Software Environment: 
Problem Description:

Kernel 2.6.17.1 shows dma_intr: status=0x58 for /dev/hda once on boot. Tested 
with plain vanilla 2.6.17.1 as well as vanilla + sws2 2.2.5.3.

This is the exact log excerpt from /var/log/syslog:

----------------------------------------------------------
Jun 23 07:35:30 deepdance kernel: hda: dma_intr: status=0x58 { DriveReady 
SeekComplete DataRequest }
Jun 23 07:35:30 deepdance kernel: ide: failed opcode was: unknown
Jun 23 07:35:30 deepdance kernel: hda: set_drive_speed_status: status=0x58 { 
DriveReady SeekComplete DataRequest }
Jun 23 07:35:30 deepdance kernel: ide: failed opcode was: unknown
----------------------------------------------------------

Does that mean that my harddisk may fail soon? Or does kernel 2.6.17.1 try a 
command my controller or harddisk does not understand?


Diagnostic data follows:

smartctl seems to indicicate that all is well:

----------------------------------------------------------
root@deepdance:~ -> smartctl -a /dev/hda
smartctl version 5.36 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
[...]
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[...]
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Short offline       Completed without error       00%      2950         -
# 2  Extended offline    Completed without error       00%      2944         -
# 3  Short offline       Completed without error       00%      2913         -
# 4  Extended offline    Completed without error       00%      2912         -
# 5  Short offline       Completed without error       00%      2903         -
# 6  Short offline       Completed without error       00%      2875         -
# 7  Short offline       Completed without error       00%      2852         -
# 8  Short offline       Completed without error       00%      2836         -
# 9  Short offline       Completed without error       00%      2808         -
#10  Short offline       Completed without error       00%      2801         -
#11  Short offline       Completed without error       00%      2766         -
#12  Extended offline    Completed without error       00%      2754         -
#13  Short offline       Completed without error       00%      2741         -
#14  Short offline       Completed without error       00%      2717         -
#15  Short offline       Completed without error       00%      2705         -
#16  Short offline       Completed without error       00%      2678         -
#17  Short offline       Completed without error       00%      2659         -
#18  Short offline       Completed without error       00%      2658         -
#19  Short offline       Completed without error       00%      2639         -
#20  Short offline       Completed without error       00%      2634         -
#21  Short offline       Completed without error       00%      2618         -
----------------------------------------------------------

It shows some rather old errors that have been logged about a year or so - I 
decided long ago that these would not be critical, but I am not completely 
sure:

----------------------------------------------------------
Error 200 occurred at disk power-on lifetime: 393 hours (16 days + 9 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  10 59 01 73 65 6c ee  Error: IDNF at LBA = 0x0e6c6573 = 241984883

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  20 03 01 73 65 6c ee 00      00:00:50.900  READ SECTOR(S)
  c8 03 01 73 65 6c ee 00      00:00:50.900  READ DMA
  20 03 01 73 65 6c ee 00      00:00:50.800  READ SECTOR(S)
  c8 03 01 73 65 6c ee 00      00:00:50.800  READ DMA
  20 03 01 73 65 6c ee 00      00:00:50.700  READ SECTOR(S)
----------------------------------------------------------

This is my harddisk (an IBM / Hitachi one):

Kernel command line: root=/dev/hda5 ro resume2=swap:/dev/hda7 resume=/dev/hda7
    ide0: BM-DMA at 0x1860-0x1867, BIOS settings: hda:DMA, hdb:pio
hda: HTS548060M9AT00, ATA DISK drive
hda: max request size: 512KiB
hda: 117210240 sectors (60011 MB) w/7877KiB Cache, CHS=16383/255/63, UDMA(100)
hda: cache flushes supported
 hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 >


It has a quite high Load_Cycle_Count, since long time I used a sub optimal apm 
value in hdparm configuration:

----------------------------------------------------------
root@deepdance:~ -> smartctl -a /dev/hda | grep 'Load_Cycle_Count' | 
awk '{print $10}'
175309
----------------------------------------------------------

(see http://blog.philkern.de/archives/173-Load-Cycle-Fix-for-Toshiba-HDDs.html)


This is my hdparm configuration (/etc/hdparm.conf):

----------------------------------------------------------
/dev/hda {
       mult_sect_io = 16
       write_cache = off
       dma = on
       apm = 0
       acoustic_management = 128
       io32_support = 3
       keep_settings_over_reset = on
       interrupt_unmask = on
}
----------------------------------------------------------

I have write cache still off due issues with XFS and disabled write cache (see 
bug #6380). I want to try 2.6.17 which should have write barriers enabled by 
default for XFS some time with write cache disabled and then enable them to see 
whether it works stable.

This is the hardware:

----------------------------------------------------------
root@deepdance:~ -> lspci
0000:00:00.0 Host bridge: Intel Corporation 82830 830 Chipset Host Bridge (rev 
04)
0000:00:01.0 PCI bridge: Intel Corporation 82830 830 Chipset AGP Bridge (rev 
04)
0000:00:1d.0 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #1) (rev 
02)
0000:00:1d.1 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #2) (rev 
02)
0000:00:1d.2 USB Controller: Intel Corporation 82801CA/CAM USB (Hub #3) (rev 
02)
0000:00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 42)
0000:00:1f.0 ISA bridge: Intel Corporation 82801CAM ISA Bridge (LPC) (rev 02)
0000:00:1f.1 IDE interface: Intel Corporation 82801CAM IDE U100 (rev 02)
0000:00:1f.3 SMBus: Intel Corporation 82801CA/CAM SMBus Controller (rev 02)
0000:00:1f.5 Multimedia audio controller: Intel Corporation 82801CA/CAM AC'97 
Audio Controller (rev 02)
0000:01:00.0 VGA compatible controller: S3 Inc. SuperSavage IX/C SDR (rev 05)
0000:02:00.0 CardBus bridge: Texas Instruments PCI1420
0000:02:00.1 CardBus bridge: Texas Instruments PCI1420
0000:02:02.0 Communication controller: Agere Systems WinModem 56k (rev 01)
0000:02:08.0 Ethernet controller: Intel Corporation 82801CAM (ICH3) PRO/100 VE 
(LOM) Ethernet Controller (rev 42)
0000:03:00.0 USB Controller: NEC Corporation USB (rev 43)
0000:03:00.1 USB Controller: NEC Corporation USB (rev 43)
0000:03:00.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
----------------------------------------------------------

I attach output of lspci -vvn to this report.


Steps to reproduce:

Honestly I have no idea. I do not what triggers this behavior.
Comment 1 Martin Steigerwald 2006-06-22 23:58:10 UTC
Created attachment 8387 [details]
output of lspci --vvn
Comment 2 Martin Steigerwald 2006-06-22 23:59:24 UTC
Created attachment 8388 [details]
configuration of kernel 2.6.17.1 I used (plain vanilla, without sws2)
Comment 3 Andrew Morton 2006-06-23 00:03:13 UTC
On Thu, 22 Jun 2006 23:55:17 -0700
bugme-daemon@bugzilla.kernel.org wrote:

> Most recent kernel where this bug did not occur: no later one tested

You misunderstand.  This question is asking whether an earlier kernel did
_not_ have this problem.  If so, what was that earlier version?

Comment 4 Martin Steigerwald 2006-06-23 00:23:38 UTC
Thanks, Andrew. I wanted to add this information as I recognized I did not 
include it and got a mid air collision ;-) with your comment.

I used 2.6.15.7 with sws2-2.2 before. I grepped through older logs and did not 
find any occurences of this error:

root@deepdance:/var/log -> zgrep "hda: dma_intr: status=0x58" syslog*
syslog:Jun 23 03:42:26 deepdance kernel: hda: dma_intr: status=0x58 { 
DriveReady SeekComplete DataRequest }
syslog:Jun 23 07:35:30 deepdance kernel: hda: dma_intr: status=0x58 { 
DriveReady SeekComplete DataRequest }
syslog.1.gz:Jun 23 01:44:26 deepdance kernel: hda: dma_intr: status=0x58 { 
DriveReady SeekComplete DataRequest }

These are all with the new kernel.

I attach the kernel configuration of that older kernel as well.

Comment 5 Martin Steigerwald 2006-06-23 00:33:25 UTC
Created attachment 8391 [details]
configuration of the older kernel (2.6.15.7-sws-2.2) that did not show those dma_intr messages


A more generic grep did not show any different results:

root@deepdance:/var/log -> zgrep "hda: dma_intr" syslog*
syslog:Jun 23 03:42:26 deepdance kernel: hda: dma_intr: status=0x58 {
DriveReady SeekComplete DataRequest }
syslog:Jun 23 07:35:30 deepdance kernel: hda: dma_intr: status=0x58 {
DriveReady SeekComplete DataRequest }
syslog.1.gz:Jun 23 01:44:26 deepdance kernel: hda: dma_intr: status=0x58 {
DriveReady SeekComplete DataRequest }


I am using 2.6.17.1 now with sws2 again cause it seems that the new sws1 from
2.6.17 seems to need that additional user space stuff in order to work - I
thought that it would work without it. At least it didnt work here - empty
screen with blinking cursor and nothing else. I decided for now to use what
works for me instead of trying some supposedly experimental userspace sws.

sws2 seems to work fine. But anyway I got this dma_intr message with 2.6.17.1
without sws2 as well (syslog.1.gz grep result).
Comment 6 Martin Steigerwald 2006-06-23 12:50:51 UTC
Now I also tested with the 2.6.16.11 the latest one of 2.6.16 that I used for a 
short time. This kernel also does not show the dma_intr messages at boot... 
thus this behaviour must have been introduced somewhere between 2.6.16.11 and 
2.6.17.1.

root@deepdance:~ -> cat /proc/version
Linux version 2.6.16.11-tp23-sws2-2.2.5 (root@deepdance) (gcc-Version 4.0.3 
(Debian 4.0.3-1)) #1 PREEMPT Fri Apr 28 13:17:58 CEST 2006

I reverted back to 2.6.15 cause sound didn't work after resume from sws 2 - it 
works nicely with 2.6.17 again ;-).

If needed I can try with 2.6.16.21 or whatever is latest 2.6.16 as well.

I attach kernel config of this one as well.
Comment 7 Martin Steigerwald 2006-06-23 12:52:20 UTC
Created attachment 8403 [details]
configuration of more recent older kernel (2.6.16.11) that does not show dma_intr messages

I leave all of these installed in case they are required for further testing.
Comment 8 James Slater 2006-06-27 05:06:59 UTC
I am experiencing (almost) the same issue. I have tested 2.6.16.20 and 2.6.17, 
the problem occurs in both.

The issue occurs when hdparm is used (-X68 -d1 -c1 -u1) while disk access is 
taking place. I've only noticed it recently due to recent changes in Gentoo 
which have enabled parallel startup of hdparm with other services. If I run 
hdparm manually, with no disk access going on, there is no problem. When it is 
run as part of the boot process the following appear in dmesg:

hda: set_drive_speed_status: status=0x58 { DriveReady SeekComplete 
DataRequest }
ReiserFS: warning: is_leaf: free space seems wrong: level=1, nr_items=29, 
free_s
pace=100 rdkey
ReiserFS: hda2: warning: vs-5150: search_by_key: invalid format found in block 
3
3810. Fsck?
ReiserFS: hda2: warning: vs-13070: reiserfs_read_locked_inode: i/o failure 
occur
red trying to find stat data of [2 709178 0x0 SD]
ReiserFS: warning: is_leaf: free space seems wrong: level=1, nr_items=29, 
free_s
pace=100 rdkey
ReiserFS: hda2: warning: vs-5150: search_by_key: invalid format found in block 
3
3810. Fsck?
ReiserFS: hda2: warning: vs-13070: reiserfs_read_locked_inode: i/o failure 
occur
red trying to find stat data of [2 709178 0x0 SD]

I assume the ReiserFS problems are related, they do not occur otherwise. 
Unfortunately often this leaves my system unable to boot with various errors 
(files missing, segmentation faults, permission problems).


My hardware is completely different (a desktop machine):

00:00.0 Host bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333]
00:01.0 PCI bridge: VIA Technologies, Inc. VT8366/A/7 [Apollo KT266/A/333 AGP]
00:0a.0 Ethernet controller: Davicom Semiconductor, Inc. 21x4x DEC-Tulip 
compatible 10/100 Ethernet (rev 31)
00:0c.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev 07)
00:0c.1 Input device controller: Creative Labs SB Live! MIDI/Game Port (rev 07)
00:0d.0 Network controller: Broadcom Corporation BCM4306 802.11b/g Wireless LAN 
Controller (rev 03)
00:0e.0 Multimedia audio controller: Ensoniq 5880 AudioPCI (rev 02)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8233 PCI to ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc. 
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
00:11.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 
Controller (rev 1b)
00:11.3 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 
Controller (rev 1b)
00:11.4 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 
Controller (rev 1b)
01:00.0 VGA compatible controller: nVidia Corporation NV11 [GeForce2 MX/MX 400] 
(rev a1)
Comment 9 Martin Steigerwald 2006-06-27 15:14:58 UTC
Hmmm, seems that I am also getting slight corruption here. With XFS.

See kernel bug #6757
Comment 10 Martin Steigerwald 2006-06-30 13:27:30 UTC
The slight corruption I had with XFS had a different cause. See that bugreport 
(bug #6757).
Comment 11 Natalie Protasevich 2007-07-08 12:42:11 UTC
Martin,
Any news on this problem? Is it still there with latest kernel?
If so, can you please do git bisect or binary search so you can identify the patch that broke it. It might take a while though.
Thanks.
Comment 12 Bartlomiej Zolnierkiewicz 2008-02-16 10:48:33 UTC
I'm closing this one, please re-open if it still happens with the recent kernels.
Comment 13 Martin Steigerwald 2008-02-16 11:44:17 UTC
Hmmm, this is ancient. I already forgot about this one.

I can't say much about it at the moment, since I am using libata since quite some time.

A 

zgrep "dma_intr" syslog*

does not yield any results on my T23 so I think at least libata doesn't have this issue ;).

So for me that issue is solved even when that is with a different driver architecture, closing bug. If anyone still has this issue with IDE drivers he/she can reopen it.
Comment 14 Martin Steigerwald 2008-02-16 11:45:58 UTC
One additional info: That zgrep results relate to 2.6.24 and 2.6.23 kernels, I think anything prior to that is logrotated to nil already.
Comment 15 Bartlomiej Zolnierkiewicz 2008-02-16 11:47:16 UTC
OK, thanks for info.