11328 – kernel 2.6.26 reports massive filesystem errors on RAID5 device; on 2.6.24 it is working but not well

Bug 11328 - kernel 2.6.26 reports massive filesystem errors on RAID5 device; on 2.6.24 it is working but not well

Summary: kernel 2.6.26 reports massive filesystem errors on RAID5 device; on 2.6.24 i...

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	IO/Storage
Classification:	Unclassified
Component:	Serial ATA (show other bugs)
Hardware:	All Linux

Importance:	P1 normal
Assignee:	Tejun Heo

URL:
Keywords:

Depends on:
Blocks:

Reported:	2008-08-14 01:02 UTC by F. P. Beekhof
Modified:	2012-10-30 15:01 UTC (History)
CC List:	6 users (show)

See Also:
Kernel Version:	2.6.24 / 2.6.26
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Dmesg of 2.6.24 (27.49 KB, text/plain) 2008-08-14 01:05 UTC, F. P. Beekhof	Details
dmesg of 2.6.26 (34.61 KB, text/plain) 2008-08-14 01:06 UTC, F. P. Beekhof	Details
list of PCI devices; obtained under 2.6.26 (7.83 KB, text/plain) 2008-08-14 01:06 UTC, F. P. Beekhof	Details
kernel log file 2.6.26 (75.09 KB, text/plain) 2008-08-14 09:15 UTC, F. P. Beekhof	Details
the syslog, contains other error msg's (82.73 KB, text/plain) 2008-08-14 09:28 UTC, F. P. Beekhof	Details
output of e2fsck, showing errors that are not there using 2.6.24 (4.51 KB, text/plain) 2008-08-14 09:37 UTC, F. P. Beekhof	Details
smartctl -a /dev/sdb obtained under 2.6.26 (9.91 KB, text/plain) 2008-08-20 11:56 UTC, F. P. Beekhof	Details
smartctl -a /dev/sdc ; other disk, same age/type (9.88 KB, text/plain) 2008-08-20 12:09 UTC, F. P. Beekhof	Details
smartctl -a /dev/sda ; This is a newer disk (5.45 KB, text/plain) 2008-08-20 12:16 UTC, F. P. Beekhof	Details
new dmesg of kernel 2.6.27 (NOT 26!) after resyncing the array under 2.6.24 (38.61 KB, text/plain) 2008-08-29 15:13 UTC, F. P. Beekhof	Details
Output of lspci on BROKEN (18.94 KB, text/plain) 2008-09-09 10:23 UTC, F. P. Beekhof	Details
Output of lspci on WORKING (39.03 KB, text/plain) 2008-09-09 10:31 UTC, F. P. Beekhof	Details
Output of lspci on BROKEN; this time _with_ root permissions (38.99 KB, text/plain) 2008-09-09 10:40 UTC, F. P. Beekhof	Details
After this BIOS config change; 2.6.27 recognizes only 4 / 7 drives. (38.99 KB, text/plain) 2008-09-09 11:12 UTC, F. P. Beekhof	Details
New dmesg (28.69 KB, text/plain) 2008-09-14 05:56 UTC, F. P. Beekhof	Details
new dmesg of 2.6.27 (39.36 KB, text/plain) 2008-09-14 06:12 UTC, F. P. Beekhof	Details
dmesg of 2.6.27 with libata.dma=0 (39.65 KB, text/plain) 2008-09-14 06:33 UTC, F. P. Beekhof	Details
dmesg of 2.6.27 with PIO4 mode (38.29 KB, text/plain) 2008-09-14 09:27 UTC, F. P. Beekhof	Details
lspci (34.61 KB, application/octet-stream) 2011-01-31 06:09 UTC, Ralf-Peter Rohbeck	Details
dmesg (51.18 KB, application/octet-stream) 2011-01-31 06:15 UTC, Ralf-Peter Rohbeck	Details
dmidecode (16.98 KB, application/octet-stream) 2011-01-31 06:17 UTC, Ralf-Peter Rohbeck	Details
lvm info (548 bytes, application/octet-stream) 2011-01-31 06:18 UTC, Ralf-Peter Rohbeck	Details
fstab (1.03 KB, application/octet-stream) 2011-01-31 06:18 UTC, Ralf-Peter Rohbeck	Details
Add an attachment (proposed patch, testcase, etc.)

Description F. P. Beekhof 2008-08-14 01:02:49 UTC

I'm running Ubuntu Hardy on a AMD64 machine with a RAID5 array in it, ext3
formatted. It has been working fine for years, with the exception of
Gutsy's kernel that couldn't handle that many IDE devices (bug 157909,
fixed in hardy).

To make backups, I bought an external drive and an eSata controller. The
controller is not supported in Hardy's kernel (2.6.24), but it is in
Intrepid, so I installed linux-image-2.6.26-5-generic from intrepid.

Running this kernel causes several changes:
1) My new controller is recognized
2) All disks are now known as /dev/sd* and not /dev/hd*
3) Copying from /dev/md0 results in many filesystem errors
4) e2fsck reports massive errors on the filesystem on /dev/md0

If I reboot into 2.6.24, all problems are gone and everything is back to
normal. This implies that there is a major problem somewhere! I happen
to mount the device read-only, but mounting it rw will no doubt result
in severe damage to the fs. I accidentally tested this by running e2fsck
under 2.6.26, which did introduce real errors that I then had to fix
under 2.6.24. Now, under 2.6.24, all is fine and 2.6.26 still reports
many errors.

Since then, Tejun Heo pointed out that under 2.6.24 were not that good either, all disks on my HPT controller were working in PIO4 (but at least, without error), and advised opening a bug report.

See attachments for details.

Comment 1 F. P. Beekhof 2008-08-14 01:05:51 UTC

Created attachment 17229 [details]
Dmesg of 2.6.24

Comment 2 F. P. Beekhof 2008-08-14 01:06:17 UTC

Created attachment 17230 [details]
dmesg of 2.6.26

Comment 3 F. P. Beekhof 2008-08-14 01:06:58 UTC

Created attachment 17231 [details]
list of PCI devices; obtained under 2.6.26

Comment 4 Bartlomiej Zolnierkiewicz 2008-08-14 01:37:35 UTC

According to dmesg-2.6.24 all disks on HPT controller are using UDMA100 and there are no error messages about dropping to PIO.

....
[   43.835503] hde: Maxtor 7L300R0, ATA DISK drive
[   43.835536] hde: host max PIO4 wanted PIO255(auto-tune) selected PIO4
[   43.837037] hde: UDMA/100 mode selected
[   43.838590] ide2 at 0x7000-0x7007,0x6802 on irq 16
[   43.838626] Probing IDE interface ide3...
[   44.127439] hdh: Maxtor 5A300J0, ATA DISK drive
[   44.463041] hdg: Maxtor 5A300J0, ATA DISK drive
[   44.463062] hdg: host max PIO4 wanted PIO255(auto-tune) selected PIO4
[   44.463201] hdg: UDMA/100 mode selected
[   44.463340] hdh: host max PIO4 wanted PIO255(auto-tune) selected PIO4
[   44.463472] hdh: UDMA/100 mode selected
[   44.463638] ide3 at 0x6400-0x6407,0x6002 on irq 16
[   44.463692] Probing IDE interface ide0...
[   44.750830] hdb: Maxtor 5A300J0, ATA DISK drive
[   45.086698] hda: Maxtor 5A300J0, ATA DISK drive
[   45.086721] hda: host max PIO4 wanted PIO255(auto-tune) selected PIO4
[   45.086858] hda: UDMA/100 mode selected
[   45.086997] hdb: host max PIO4 wanted PIO255(auto-tune) selected PIO4
[   45.087126] hdb: UDMA/100 mode selected
[   45.087295] ide0 at 0x8800-0x8807,0x8402 on irq 16
[   45.087351] Probing IDE interface ide1...
[   45.374370] hdd: Maxtor 5A300J0, ATA DISK drive
[   45.710125] hdc: Maxtor 5A300J0, ATA DISK drive
[   45.710147] hdc: host max PIO4 wanted PIO255(auto-tune) selected PIO4
[   45.710293] hdc: UDMA/100 mode selected
[   45.710432] hdd: host max PIO4 wanted PIO255(auto-tune) selected PIO4
[   45.710564] hdd: UDMA/100 mode selected
[   45.710728] ide1 at 0x8000-0x8007,0x7802 on irq 16
...

[ You can verify if DMA is really used by checking DMA enable setting with
  'hdparm -d /dev/device' and device speed with 'hdparm -Ttf /dev/device'. ]

HPT support in libata (SCSI emulation) is known to be experimental so I advise to use IDE subsystem.  I'll re-assign the bug to the right people anyway.

Comment 5 Alan 2008-08-14 03:37:57 UTC

Please attach logs of the actual errors you see. Without that it is hard to do anything.

Testing the legacy HPT driver would also be useful as Bartlomiej suggests. We've seen a lot of 2.6.25/6 IRQ routing problems that cause disk trouble, although your bug sounds a bit different

Comment 6 F. P. Beekhof 2008-08-14 05:24:44 UTC

Do you mean that I should test with the legacy HPT driver in 2.6.26 ? 2.6.24 did not show any problems and named the disks /dev/hd*, so I think it uses the legacy driver. (I could be wrong)

I don't have access to my machine until this evening, but when they were all like these examples:
> Mar 20 01:19:50 server kernel: attempt to access beyond end of device
> Mar 20 01:19:50 server kernel: 03:41: rw=0, want=950783358, limit=11773912
( example from: http://www.patoche.org/LTT/disk/00000163.html )

Another symptom was the fact that e2fsck reported many different errors under 2.6.26, but under 2.6.24 it was fine. 

In short, it looks like reading produces corrupted data without specifically causing error messages; the "attempt to access beyond end of device" is probably a consequence of misread metadata in the file system.

Comment 7 F. P. Beekhof 2008-08-14 09:15:04 UTC

Created attachment 17243 [details]
kernel log file 2.6.26

contains (plenty of) error messages

Comment 8 F. P. Beekhof 2008-08-14 09:28:19 UTC

Created attachment 17244 [details]
the syslog, contains other error msg's

Here's something in syslog that I didn't see in dmesg / kern.log:

Aug 14 18:03:44 DeathStar kernel: [ 1434.849462] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 14 18:03:44 DeathStar kernel: [ 1434.849470] ata2.00: BMDMA stat 0x4
Aug 14 18:03:44 DeathStar kernel: [ 1434.849476] ata2.00: cmd 25/00:70:47:0f:74/00:00:1f:00:00/e0 tag 0 dma 57344 in
Aug 14 18:03:44 DeathStar kernel: [ 1434.849477]          res 51/40:00:4b:0f:74/40:00:1f:00:00/e0 Emask 0x9 (media error)
Aug 14 18:03:44 DeathStar kernel: [ 1434.849480] ata2.00: status: { DRDY ERR }
Aug 14 18:03:44 DeathStar kernel: [ 1434.849482] ata2.00: error: { UNC }
Aug 14 18:03:44 DeathStar kernel: [ 1434.856534] Find mode for 12 reports A81F442
Aug 14 18:03:44 DeathStar kernel: [ 1434.856544] Find mode for 12 reports A81F442
Aug 14 18:03:44 DeathStar kernel: [ 1434.856549] Find mode for DMA 69 reports 12848242
Aug 14 18:03:44 DeathStar kernel: [ 1434.856554] Find mode for DMA 69 reports 12848242
Aug 14 18:03:44 DeathStar kernel: [ 1434.872532] ata2.00: configured for UDMA/100
Aug 14 18:03:44 DeathStar kernel: [ 1434.888532] ata2.01: configured for UDMA/100
Aug 14 18:03:44 DeathStar kernel: [ 1434.888551] ata2: EH complete
Aug 14 18:03:46 DeathStar kernel: [ 1436.864772] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Aug 14 18:03:46 DeathStar kernel: [ 1436.864780] ata2.00: BMDMA stat 0x4
Aug 14 18:03:46 DeathStar kernel: [ 1436.864786] ata2.00: cmd 25/00:70:47:0f:74/00:00:1f:00:00/e0 tag 0 dma 57344 in
Aug 14 18:03:46 DeathStar kernel: [ 1436.864787]          res 51/40:00:4b:0f:74/40:00:1f:00:00/e0 Emask 0x9 (media error)
Aug 14 18:03:46 DeathStar kernel: [ 1436.864790] ata2.00: status: { DRDY ERR }
Aug 14 18:03:46 DeathStar kernel: [ 1436.864792] ata2.00: error: { UNC }
Aug 14 18:03:46 DeathStar kernel: [ 1436.872544] Find mode for 12 reports A81F442
Aug 14 18:03:46 DeathStar kernel: [ 1436.872554] Find mode for 12 reports A81F442
Aug 14 18:03:46 DeathStar kernel: [ 1436.872560] Find mode for DMA 69 reports 12848242
Aug 14 18:03:46 DeathStar kernel: [ 1436.872565] Find mode for DMA 69 reports 12848242
Aug 14 18:03:46 DeathStar kernel: [ 1436.888520] ata2.00: configured for UDMA/100
Aug 14 18:03:46 DeathStar kernel: [ 1436.904499] ata2.01: configured for UDMA/100
Aug 14 18:03:46 DeathStar kernel: [ 1436.904518] ata2: EH complete
Aug 14 18:03:46 DeathStar kernel: [ 1437.659807] sd 1:0:0:0: [sdb] 585940320 512-byte hardware sectors (300001 MB)
Aug 14 18:03:46 DeathStar kernel: [ 1437.659833] sd 1:0:0:0: [sdb] Write Protect is off
Aug 14 18:03:46 DeathStar kernel: [ 1437.659836] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
Aug 14 18:03:46 DeathStar kernel: [ 1437.659871] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Comment 9 F. P. Beekhof 2008-08-14 09:37:22 UTC

Created attachment 17245 [details]
output of e2fsck, showing errors that are not there using 2.6.24

Comment 10 Tejun Heo 2008-08-20 00:54:09 UTC

The drive is reporting media error and md the fs on it somehow thinks it's much larger than md thinks it is.  Well, the media error part is most likely just a bad disk but fs getting corrupt is worrisome.  It could be that the faulting drive returned corrupt data.  Can you please report 'smartctl -a' output of /dev/sdb?  And you better run full smart check on the drive and rebuild the array.

Comment 11 F. P. Beekhof 2008-08-20 11:56:05 UTC

Created attachment 17340 [details]
smartctl -a /dev/sdb obtained under 2.6.26

Comment 12 F. P. Beekhof 2008-08-20 12:09:16 UTC

Created attachment 17341 [details]
smartctl -a /dev/sdc ; other disk, same age/type

Added just so you can compare with /dev/sdb
Actually, I'm not an expert on this subject, but I had a quick look at all disks, and could not find that stood out from the others in any sense. Then again, it's very well possible that I overlooked something.

Comment 13 F. P. Beekhof 2008-08-20 12:16:16 UTC

Created attachment 17342 [details]
smartctl -a /dev/sda ; This is a newer disk

BTW, does ata2.00 correspond to /dev/sdb ? How do I check that ?

Comment 14 F. P. Beekhof 2008-08-20 12:31:29 UTC

Tried to reproduce the problem again, this is easy under 2.6.26...
Aug 20 21:25:02 DeathStar /USR/SBIN/CRON[8848]: (root) CMD ([ -x /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [ "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1 $SA1_OPTIONS 1 1 ; })
Aug 20 21:27:47 DeathStar kernel: [ 6098.519099] attempt to access beyond end of device
Aug 20 21:27:47 DeathStar kernel: [ 6098.519135] md0: rw=0, want=22642891208, limit=2929692160
Aug 20 21:27:47 DeathStar kernel: [ 6098.531869] attempt to access beyond end of device
Aug 20 21:27:47 DeathStar kernel: [ 6098.531916] md0: rw=0, want=22642891208, limit=2929692160
Aug 20 21:28:04 DeathStar kernel: [ 6115.075380] attempt to access beyond end of device
Aug 20 21:28:04 DeathStar kernel: [ 6115.075390] md0: rw=0, want=33383038176, limit=2929692160
Aug 20 21:28:04 DeathStar kernel: [ 6115.088949] attempt to access beyond end of device
Aug 20 21:28:04 DeathStar kernel: [ 6115.089208] md0: rw=0, want=33383038176, limit=2929692160
Aug 20 21:28:48 DeathStar kernel: [ 6159.417935] attempt to access beyond end of device
Aug 20 21:28:48 DeathStar kernel: [ 6159.417943] md0: rw=0, want=14739476584, limit=2929692160
Aug 20 21:28:48 DeathStar kernel: [ 6159.432137] attempt to access beyond end of device
Aug 20 21:28:48 DeathStar kernel: [ 6159.432144] md0: rw=0, want=14739476584, limit=2929692160
Aug 20 21:28:50 DeathStar kernel: [ 6161.245244] attempt to access beyond end of device
Aug 20 21:28:50 DeathStar kernel: [ 6161.245251] md0: rw=0, want=27074384712, limit=2929692160
Aug 20 21:28:50 DeathStar kernel: [ 6161.253013] attempt to access beyond end of device
Aug 20 21:28:50 DeathStar kernel: [ 6161.253020] md0: rw=0, want=27074384712, limit=2929692160
Aug 20 21:28:54 DeathStar kernel: [ 6165.172581] attempt to access beyond end of device
Aug 20 21:28:54 DeathStar kernel: [ 6165.172588] md0: rw=0, want=24797184392, limit=2929692160
Aug 20 21:28:54 DeathStar kernel: [ 6165.181799] attempt to access beyond end of device
Aug 20 21:28:54 DeathStar kernel: [ 6165.181806] md0: rw=0, want=24797184392, limit=2929692160

Comment 15 Tejun Heo 2008-08-20 20:11:55 UTC

Hmm... no stat stands out in the smartctl output but /dev/sdb is reporting more errors and more importantly UNC on READ (the drive can't read the data at the given LBA).  It looks like your filesystem is corrupt and certain files point way outside the actual device limit.  Can you please resync the array and run fsck on it and then see whether the problem goes away?

Comment 16 Tejun Heo 2008-08-20 20:23:06 UTC

Oh.. and /dev/sdX <-> ataN.DD mapping can be determined by reading the log.  It's a bit tedious and stupid, I have to admit. :-)

1. When ata gets probed it prints...

  scsi0 : pata_hpt37x
  scsi1 : pata_hpt37x
  ata1: PATA max UDMA/100 cmd 0x7000 ctl 0x6800 bmdma 0x5800 irq 16
  ata2: PATA max UDMA/100 cmd 0x6400 ctl 0x6000 bmdma 0x5808 irq 16

 Which indicates that pata_hpt37x allocated two SCSI hosts and mapped them to ata1 and 2, so ata1 -> scsi0, ata2-> scsi1.

2. Then later during SCSI probing...

  sd 0:0:0:0: [sda] 586114704 512-byte hardware sectors (300091 MB)
  sd 0:0:0:0: [sda] Write Protect is off
  ...
  sd 1:0:0:0: [sdb] 585940320 512-byte hardware sectors (300001 MB)
  ...
  sd 1:0:1:0: [sdc] 585940320 512-byte hardware sectors (300001 MB)


 Here, the four numbers represent host:channel:id:lun.  So, sdb lives on SCSI host 1 and its device ID is 0.  SCSI host 1 is ata2 and libata maps device number to SCSI device ID, so 1:0:0:0 is ata2.00 and 1:0:1:0 is ata2.01.  Heh.. this should really be much easier.  :-(

Comment 17 F. P. Beekhof 2008-08-21 02:20:26 UTC

Hmmm...

The errors in comment #14 seem consistent with what you describe in
comment #15.

I did run fsck on 2.6.24 (ubuntu uses legacy drivers here), and then
everything is fine.

On 2.6.26 (ubuntu switches to libata), fsck reports many errors, as does
any attempt to copy many files.
My hypothesis is that the observed errors with 2.6.26 are due to a
kernel bug in the ubuntu 2.6.26 kernel, and that if I make any
modification to the filesystem or the array (rebuild/fsck) under this
kernel, I'll actually damage my data.

"Fixing" the array and filesystem is most easily done by rebooting into
2.6.24 :-)
Just to make sure, I'll do that this evening and try to fsck and copy
files again to make sure that it all works flawless. However, I've done
this several times already and the results are consistent - 2.6.24: OK;
2.6.26: KO.

Actually, I already did run an fsck (partially) under 2.6.26, which
proceeded to correct many "errors" until I didn't trust what was going
on and aborted. Subsequently, I reran fsck under 2.6.24, which corrected
the errors introduced under 2.6.26's fsck. To verify, I've ran
"fsck -fn /dev/md0" under both kernels afterwards: 2.6.24 reports all
fine; 2.6.26 reports many errors.

bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11328
> 
> 
> tj@kernel.org changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>          AssignedTo|jgarzik@pobox.com           |tj@kernel.org
>              Status|NEW                         |ASSIGNED
> 
> 
> 
> 
> ------- Comment #15 from tj@kernel.org  2008-08-20 20:11 -------
> Hmm... no stat stands out in the smartctl output but /dev/sdb is reporting
> more
> errors and more importantly UNC on READ (the drive can't read the data at the
> given LBA).  It looks like your filesystem is corrupt and certain files point
> way outside the actual device limit.  Can you please resync the array and run
> fsck on it and then see whether the problem goes away?
> 
>

Comment 18 F. P. Beekhof 2008-08-21 16:00:08 UTC

# uname -a
Linux DeathStar 2.6.24-19-generic #1 SMP Fri Jul 11 21:01:46 UTC 2008 x86_64 GNU/Linux
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 hdh1[0] hda1[6](S) hdb1[5] hde1[4] hdg1[3] hdc1[2] hdd1[1]
      1464846080 blocks level 5, 64k chunk, algorithm 2 [6/6] [UUUUUU]

unused devices: <none>
# mdadm -Q --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sat Dec 18 14:45:38 2004
     Raid Level : raid5
     Array Size : 1464846080 (1396.99 GiB 1500.00 GB)
  Used Dev Size : 292969216 (279.40 GiB 300.00 GB)
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Aug 20 21:15:29 2008
          State : clean
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 310c9ec7:85f13ee4:cf831c97:1f936049
         Events : 0.761672

    Number   Major   Minor   RaidDevice State
       0      34       65        0      active sync   /dev/hdh1
       1      22       65        1      active sync   /dev/hdd1
       2      22        1        2      active sync   /dev/hdc1
       3      34        1        3      active sync   /dev/hdg1
       4      33        1        4      active sync   /dev/hde1
       5       3       65        5      active sync   /dev/hdb1

       6       3        1        -      spare   /dev/hda1
#  e2fsck -fvn /dev/md0
e2fsck 1.40.8 (13-Mar-2008)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

  153581 inodes used (0.08%)
   12326 non-contiguous inodes (8.0%)
         # of inodes with ind/dind/tind blocks: 122892/67765/12
366200089 blocks used (100.00%)
       0 bad blocks
      11 large files

  140161 regular files
   13409 directories
       0 character device files
       1 block device file
       0 fifos
       0 links
       0 symbolic links (0 fast symbolic links)
       1 socket
--------
  153572 files
#

Comment 19 Tejun Heo 2008-08-29 03:37:20 UTC

Hmm... As there's redundancy in the array, the different might be caused by from which disk the data is being read.  Can you please try to resync the array from 2.6.24 and see whether the problem persists on more recent kernels?

Comment 20 F. P. Beekhof 2008-08-29 04:09:41 UTC

Ubuntu maintainers asked me to try out their latest 2.6.27 kernel, the
problem is still there:
https://bugs.launchpad.net/bugs/256316

Ok, I'll try to do that resync in the weekend. It will take some time,
I'll have to study the subject matter, and the operations required are
rather time-costly too.

1) Is there a (set of) command(s) that let's me do that "resync" ? Or
should I just Read The Fine Manual of mdadm ?
2) Does such an action destroy the file system present ?
3) If so, I might want switch from ext3 to xfs or reiserfs. Would there
be any objection to that, in that it might hamper the analysis ?

Comment 21 Tejun Heo 2008-08-29 04:14:52 UTC

1) I think the command is "mdadm --update=resync /dev/md0" but not too sure.
2) Usually not but it will be wise to back up under the circumstances.
3) I think it's best to stick with ext3 for now to eliminate other variables but you're free to choose what to put on your array.  :-)

Comment 22 F. P. Beekhof 2008-08-29 15:11:11 UTC

ok, resync done (under 2.6.24)
correct procedure is "mdadm --stop /dev/md0; mdadm -A --update=resync /dev/md0"
This does not destroy the fs, apparently.

Results:
$ cat mdadm-2.6.24-after-resync.txt
/dev/md0:
        Version : 00.90.03
  Creation Time : Sat Dec 18 14:45:38 2004
     Raid Level : raid5
     Array Size : 1464846080 (1396.99 GiB 1500.00 GB)
  Used Dev Size : 292969216 (279.40 GiB 300.00 GB)
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Aug 29 23:16:53 2008
          State : active
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 310c9ec7:85f13ee4:cf831c97:1f936049
         Events : 0.761677

    Number   Major   Minor   RaidDevice State
       0      34       65        0      active sync   /dev/hdh1
       1      22       65        1      active sync   /dev/hdd1
       2      22        1        2      active sync   /dev/hdc1
       3      34        1        3      active sync   /dev/hdg1
       4      33        1        4      active sync   /dev/hde1
       5       3       65        5      active sync   /dev/hdb1

       6       3        1        -      spare   /dev/hda1

Under 2.6.27 (NOT 26!) it looks like:
$ cat mdadm-2.6.27-after-resync.txt
/dev/md0:
        Version : 00.90.03
  Creation Time : Sat Dec 18 14:45:38 2004
     Raid Level : raid5
     Array Size : 1464846080 (1396.99 GiB 1500.00 GB)
  Used Dev Size : 292969216 (279.40 GiB 300.00 GB)
   Raid Devices : 6
  Total Devices : 7
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Fri Aug 29 23:23:11 2008
          State : clean
 Active Devices : 6
Working Devices : 7
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 310c9ec7:85f13ee4:cf831c97:1f936049
         Events : 0.761678

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       97        1      active sync   /dev/sdg1
       2       8       81        2      active sync   /dev/sdf1
       3       8       17        3      active sync   /dev/sdb1
       4       8        1        4      active sync   /dev/sda1
       5       8       65        5      active sync   /dev/sde1

       6       8       49        -      spare   /dev/sdd1

I'll attach the dmesg of 2.6.27, it looks nicer ...

... but still:
$ cat */*.avi > /dev/null
cat: About Schmidt/About Schmidt (1of2).avi: Input/output error
cat: All Quiet On The Western Front/all.quiet.on.the.western.front.1930.cd1.avi: Input/output error
cat: Apocalypse Now/Apocalypse_Now_01.avi: Input/output error

These are different file than I reported earlier today, see:
https://bugs.launchpad.net/bugs/256316
so it would appears errors appear at random.

From syslog:
Aug 29 23:52:58 DeathStar kernel: [ 1751.068013] attempt to access beyond end of device
Aug 29 23:52:58 DeathStar kernel: [ 1751.068013] md0: rw=0, want=21544843320, limit=2929692160
Aug 29 23:52:58 DeathStar kernel: [ 1751.076014] attempt to access beyond end of device
Aug 29 23:52:58 DeathStar kernel: [ 1751.076014] md0: rw=0, want=21544843320, limit=2929692160
Aug 29 23:55:02 DeathStar /USR/SBIN/CRON[7533]: (root) CMD ([ -x /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [ "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1 $SA1_OPTIONS 1 1 ; })
Aug 29 23:55:07 DeathStar kernel: [ 1880.443924] attempt to access beyond end of device
Aug 29 23:55:07 DeathStar kernel: [ 1880.444013] md0: rw=0, want=8109118448, limit=2929692160
Aug 29 23:55:07 DeathStar kernel: [ 1880.452022] attempt to access beyond end of device
Aug 29 23:55:07 DeathStar kernel: [ 1880.452022] md0: rw=0, want=8109118448, limit=2929692160
Aug 29 23:57:04 DeathStar kernel: [ 1996.980014] attempt to access beyond end of device
Aug 29 23:57:04 DeathStar kernel: [ 1996.980014] md0: rw=0, want=8098573088, limit=2929692160
Aug 29 23:57:04 DeathStar kernel: [ 1996.988020] attempt to access beyond end of device
Aug 29 23:57:04 DeathStar kernel: [ 1996.988020] md0: rw=0, want=8098573088, limit=2929692160

Comment 23 F. P. Beekhof 2008-08-29 15:13:02 UTC

Created attachment 17535 [details]
new dmesg of kernel 2.6.27 (NOT 26!) after resyncing the array under 2.6.24

Comment 24 Tejun Heo 2008-08-30 02:56:57 UTC

Thanks for testing.  Yeah, it looks like 2.6.27 is reading off wrong data from the array and gets the filesystem confused and leads it to try to read sectors way beyond the actual fs size.  Can you please do the following test from 2.6.27?

* For each disk run "while true; do dd if=/dev/sdX bs=1M skip=128 count=512 | sha1sum; done" on separate terminals concurrently and see whether the values stay stable.

(cc'ing Alan) Hello, Alan.  Beekhof is running RAID5 array consisted of five maxtor drives hanging off two pata_hpt37x controllers.  It works fine on 2.6.24 but both 2.6.26 and 27 seem to suffer from data corruption.  Does anything ring a bell?

Thanks.

Comment 25 F. P. Beekhof 2008-08-30 07:28:34 UTC

Note 1: there are 7 maxtor drives in the array, 5 data, 1 parity & 1 spare. They are identical save one; that one is the same model of a later generation (added later as spare). To my knowledge all disks in the array connect to a htp374 on a PCI card.

Note 2: the 2.6.24 is Ubuntu's Hardy kernel, using old ide driver (/dev/hdX). The 2.6.26/27 kernel are from Intrepid, using libata.

So I did the test, and the values are not stable. Actually, when I was writing a script with command in it, I tested on /dev/sdb; and it seemed stable. When I start reading from the other disks in parallel, it broke, only /dev/sdb gave consistent hash values.

# cat tj.sh
#!/bin/bash

while [ true ]; do
#for i in `seq 3`; do
        dd if=/dev/sd$1 bs=1M skip=128 count=512 | sha1sum
done
# ./tj.sh a
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 29.3459 s, 18.3 MB/s
d5e83831756dd3b95a9a9c20c1842e1e10c1e606  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 70.0029 s, 7.7 MB/s
d5e83831756dd3b95a9a9c20c1842e1e10c1e606  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 52.7025 s, 10.2 MB/s
e4132a1e7d879bffe3ac68e46a5ed721c27576d9  -

# ./tj.sh b
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 43.4472 s, 12.4 MB/s
d3ab9fe811354c5d2d3800d1808a0141b3a38409  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 72.8738 s, 7.4 MB/s
d3ab9fe811354c5d2d3800d1808a0141b3a38409  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 49.9387 s, 10.8 MB/s
d3ab9fe811354c5d2d3800d1808a0141b3a38409  -

# ./tj.sh c
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 60.7143 s, 8.8 MB/s
3d46ab57b5283289fc43e0ee916cee8b492d7797  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 62.0657 s, 8.7 MB/s
8dfe1a0c9242b4b02c20efa38994fa09a6823b58  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 46.7187 s, 11.5 MB/s
149447cc31cb20736651d3de8ca57ef05b76db04  -

# ./tj.sh d
512+0 records in
512+0 records out
f940399d486537d3a3c702e89647cc236dbf64e7  -
536870912 bytes (537 MB) copied, 70.2579 s, 7.6 MB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 51.4253 s, 10.4 MB/s
44f9bd859c53277fa929d7ca6464b392e4e5d5ea  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 45.5928 s, 11.8 MB/s
18b5163f37336295781027a3e0f3887deb8849f4  -

# ./tj.sh e
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 75.7312 s, 7.1 MB/s
2e36294b7d5937d6908282b3bf237cfff4830d54  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 48.4509 s, 11.1 MB/s
2e36294b7d5937d6908282b3bf237cfff4830d54  -

# ./tj.sh f
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 72.6822 s, 7.4 MB/s
c66c60abbccac2ff45157fd03e739c31adbf423b  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 48.7152 s, 11.0 MB/s
c66c60abbccac2ff45157fd03e739c31adbf423b  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 49.6727 s, 10.8 MB/s
9f66a5d6bac79c855d47f84e65f9d92b0eb7ad07  -

# ./tj.sh g
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 50.5157 s, 10.6 MB/s
ac658d94574f4fcbc075253725960c8c03659abf  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 50.7572 s, 10.6 MB/s
69492386985221600892fecca038783b17dd516a  -

Comment 26 F. P. Beekhof 2008-08-30 07:29:38 UTC

Oh, /dev/sde was consistent too.

Comment 27 F. P. Beekhof 2008-09-05 02:24:57 UTC

I posted a request for help to:
https://bugs.launchpad.net/bugs/157909
This bug affected ubuntu users with many PATA devices, i.e. people that are likely to be affected when they upgrade to Intrepid. Maybe something will turn up.

Comment 28 Tejun Heo 2008-09-08 02:12:50 UTC

It's most likely pata_hpt37x specific problem.  Can you attach the results of "lspci -nnvvvxxx" on the broken and working configurations?

Comment 29 Alan 2008-09-08 03:47:52 UTC

I've not seen a similar report and as the data transfers are CRC protected its a bit of an odd one, especially as it seems to be read side not write side.  The drive logging a media error is odd - thats either a drive with a real fault (which still shouldnt end up as corruption on the arry) or  a few drives seem to log media error for requests to read an out of range sector.

Would like to know what happens on Tejun's stability test with the boot option libata.dma=0


As to the corruption - I too suspect the data on disk is valid and in memory is ending up wrong, in which case as you say you want to keep it mounted r/o for testing so nothing gets back onto the disks that shouldnt

Comment 30 F. P. Beekhof 2008-09-09 10:23:28 UTC

Created attachment 17696 [details]
Output of lspci on BROKEN

Comment 31 F. P. Beekhof 2008-09-09 10:31:33 UTC

Created attachment 17697 [details]
Output of lspci on WORKING

Comment 32 F. P. Beekhof 2008-09-09 10:40:10 UTC

Created attachment 17698 [details]
Output of lspci on BROKEN; this time _with_ root permissions

Please ignore previous version; forgot to sudo, so its not complete.

Comment 33 F. P. Beekhof 2008-09-09 11:03:09 UTC

Interesting turn of events....

I got a "BIOS checksum error", so entered the bios menu and restored order. I've switched on ACPI now, and enabled "v2", maybe it was off before - I don't know. Anyway, now 2.6.24 wants to use libata too. But not with much success:
Linux DeathStar 2.6.24-19-generic #1 SMP Fri Jul 11 21:01:46 UTC 2008 x86_64 GNU/Linux

[   44.333016] Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
[   44.333022] ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
[   44.345024] HPT374: IDE controller (0x1103:0x0008 rev 0x07) at  PCI slot 0000:00:09.0
[   44.345043] ACPI: PCI Interrupt 0000:00:09.0[A] -> GSI 16 (level, low) -> IRQ 16
[   44.345051] PCI: Unable to reserve I/O region #5:100@5000 for device 0000:00:09.0
[   44.345057] HPT366_IDE: probe of 0000:00:09.0 failed with error -16
[   44.377974] ACPI: PCI Interrupt 0000:00:0a.0[A] -> GSI 17 (level, low) -> IRQ 17

Let me try what happens in 2.6.27 now.....

Comment 34 F. P. Beekhof 2008-09-09 11:12:35 UTC

Created attachment 17699 [details]
After this BIOS config change; 2.6.27 recognizes only 4 / 7 drives.

I'll fiddle with the bios some more. Maybe there's an update somewhere.

Comment 35 F. P. Beekhof 2008-09-14 05:56:43 UTC

Created attachment 17773 [details]
New dmesg

I've taken a screwdriver and pulled outthe initio card, and placed the hpt473 in a different PCI slot. At least now all drives are recognized again.

Then I booted into 2.6.24, the dmesg is hereby attached. Then ran the stability test, all is fine.

Will now reboot into 2.6.27 and see what happens.

Comment 36 F. P. Beekhof 2008-09-14 06:11:26 UTC

@ previous post: hpt374, not hpt473

Comment 37 F. P. Beekhof 2008-09-14 06:12:58 UTC

Created attachment 17774 [details]
new dmesg of 2.6.27

Now in 2.6.27.

All disks are recognized. Ran the stability test: NOT stable.
Is there anything else I can do ?

Comment 38 F. P. Beekhof 2008-09-14 06:33:21 UTC

Created attachment 17775 [details]
dmesg of 2.6.27 with libata.dma=0

I think I saw something along the lines of "boot option libata.dma=0 not recognized" flash by on screen, anyway, here's the dmesg so you can check if I did it right, and that it did what you'd expect. That 'libata.dma=0' is in the command line options, but still drives are configured as UDMA/100 .

Result of the stability test: NOT stable.

Comment 39 F. P. Beekhof 2008-09-14 09:27:44 UTC

Created attachment 17776 [details]
dmesg of 2.6.27 with PIO4 mode

Ok, I managed to disable dma under 2.6.27 in PIO4:
add "options libata dma=0" to /etc/modprobe.d/options
then regenerate initrg.img & reboot

stability test: NOT stable (but better than with DMA)
There's only one error, usually with dma there are many more errors.
Maybe there is more than one bug.

---------- SDA (one error here) ------------
$ while [ 1 ]; do dd if=/dev/sda bs=1M skip=128 count=512 | sha1sum; done
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1414.08 s, 380 kB/s
d5e83831756dd3b95a9a9c20c1842e1e10c1e606  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1601.14 s, 335 kB/s
d5e83831756dd3b95a9a9c20c1842e1e10c1e606  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1585.15 s, 339 kB/s
30a441bdb552b5e2a2cb826201452e5ff03b45b3  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1590.44 s, 338 kB/s
d5e83831756dd3b95a9a9c20c1842e1e10c1e606  -
^C470+0 records in
469+0 records out
491782144 bytes (492 MB) copied, 1239.04 s, 397 kB/s

------ sdb ------------
$ while [ 1 ]; do dd if=/dev/sdb bs=1M skip=128 count=512 | sha1sum; done
d3ab9fe811354c5d2d3800d1808a0141b3a38409  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1525.09 s, 352 kB/s
d3ab9fe811354c5d2d3800d1808a0141b3a38409  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1609.51 s, 334 kB/s
d3ab9fe811354c5d2d3800d1808a0141b3a38409  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1633.86 s, 329 kB/s
d3ab9fe811354c5d2d3800d1808a0141b3a38409  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1550.52 s, 346 kB/s

----------- sdc -------------
$ while [ 1 ]; do dd if=/dev/sdc bs=1M skip=128 count=512 | sha1sum; done
3d46ab57b5283289fc43e0ee916cee8b492d7797  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1522.99 s, 353 kB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1615.69 s, 332 kB/s
3d46ab57b5283289fc43e0ee916cee8b492d7797  -
3d46ab57b5283289fc43e0ee916cee8b492d7797  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1542.99 s, 348 kB/s
3d46ab57b5283289fc43e0ee916cee8b492d7797  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1564.58 s, 343 kB/s

---------- sdd -------------
$ while [ 1 ]; do dd if=/dev/sdd bs=1M skip=128 count=512 | sha1sum; done

f940399d486537d3a3c702e89647cc236dbf64e7  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1620.77 s, 331 kB/s
f940399d486537d3a3c702e89647cc236dbf64e7  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1587.29 s, 338 kB/s
f940399d486537d3a3c702e89647cc236dbf64e7  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1626.25 s, 330 kB/s
f940399d486537d3a3c702e89647cc236dbf64e7  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1616.49 s, 332 kB/s

-------sde -----------
 while [ 1 ]; do dd if=/dev/sde bs=1M skip=128 count=512 | sha1sum; done
2e36294b7d5937d6908282b3bf237cfff4830d54  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1596.04 s, 336 kB/s
2e36294b7d5937d6908282b3bf237cfff4830d54  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1562.05 s, 344 kB/s
2e36294b7d5937d6908282b3bf237cfff4830d54  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1606.15 s, 334 kB/s
2e36294b7d5937d6908282b3bf237cfff4830d54  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1622.05 s, 331 kB/s

-------------- sdf ------------
$ while [ 1 ]; do dd if=/dev/sdf bs=1M skip=128 count=512 | sha1sum; done
c66c60abbccac2ff45157fd03e739c31adbf423b  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1620.67 s, 331 kB/s
c66c60abbccac2ff45157fd03e739c31adbf423b  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1574.19 s, 341 kB/s
c66c60abbccac2ff45157fd03e739c31adbf423b  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1613.04 s, 333 kB/s
c66c60abbccac2ff45157fd03e739c31adbf423b  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1620.95 s, 331 kB/s

------------------ sdg--------------

$ while [ 1 ]; do dd if=/dev/sdg bs=1M skip=128 count=512 | sha1sum; done
69492386985221600892fecca038783b17dd516a  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1599.33 s, 336 kB/s
69492386985221600892fecca038783b17dd516a  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1610.98 s, 333 kB/s
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1609.5 s, 334 kB/s
69492386985221600892fecca038783b17dd516a  -
69492386985221600892fecca038783b17dd516a  -
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 1577.66 s, 340 kB/s

Comment 40 Tejun Heo 2008-09-28 20:45:56 UTC

Thanks for testing.  From the lspci output, the following differences stand out.

* For fn0, BusMaster is off for pata_hpt37x.  Have no idea why.  This is turned on inside ata_pci_sff_init_one().

* Both fn0 and 1 have different latency configured.

* For fn0, reg 0x5b which is the DPLL setting has different value.  hpt366 has 0x17 while pata_hpt37x has 0x1f.  I dunno much about hpt controllers but this seems to be the clock parameter for the ATA bus, so it could be the offending part.

* For fn1, 0x4* regs which determine the programmed transfer mode look quite different between hpt366 and pata_hpt37x although they end up using the same DMA mode.

Alan, does anything ring a bell?  Maybe we should try to make pata_hpt37x program the controller exactly the same as hpt366 does?

Thanks.

Comment 41 F. P. Beekhof 2008-10-02 05:39:15 UTC

In http://www.kernel.org/pub/linux/kernel/v2.6/testing/ChangeLog-2.6.27-rc7
there is the following message, I don't know if it refers to the libata driver or the old driver:

commit 62ff2ecf7a4e69f7271b7f7a57aaee76ffe610f2
Author: Masoud Sharbiani <masouds@google.com>
Date:   Wed Sep 10 22:22:34 2008 +0200

    ide: Fix pointer arithmetic in hpt3xx driver code (3rd try)
    
    git commit 74811f355f4f69a187fa74892dcf2a684b84ce99 causes crash at
    module load (or boot) time on my machine with a hpt374 controller.
    The reason for this is that for initializing second controller which sets
    (hwif->dev == host->dev[1]) to true (1), adds 1 to a void ptr, which
    advances it by one byte instead of advancing it by sizeof(hpt_info) bytes.
    Because of this, all initialization functions get corrupted data in info
    variable which causes a crash at boot time.
    
    This patch fixes that and makes my machine boot again.
    
    The card itself is a HPT374 raid conroller: Here is the lspci -v output:
    03:06.0 RAID bus controller: HighPoint Technologies, Inc. HPT374 (rev
    07)
            Subsystem: HighPoint Technologies, Inc. Unknown device 0001
            Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 28
            I/O ports at 8000 [size=8]
            I/O ports at 7800 [size=4]
            I/O ports at 7400 [size=8]
            I/O ports at 7000 [size=4]
            I/O ports at 6800 [size=256]
            Expansion ROM at fe8e0000 [disabled] [size=128K]
            Capabilities: [60] Power Management version 2
    
    03:06.1 RAID bus controller: HighPoint Technologies, Inc. HPT374 (rev
    07)
            Subsystem: HighPoint Technologies, Inc. Unknown device 0001
            Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 28
            I/O ports at 9800 [size=8]
            I/O ports at 9400 [size=4]
            I/O ports at 9000 [size=8]
            I/O ports at 8800 [size=4]
            I/O ports at 8400 [size=256]
            Capabilities: [60] Power Management version 2
    
    Signed-off-by: Masoud Sharbiani <masouds@google.com>
    Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    [bart: use dev_get_drvdata() per Sergei's suggestion]
    Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>

I think this is about the old ide driver, bit I'm not sure, mentioning it just in case.

Comment 42 Roy Abou Assaly 2009-01-13 07:57:00 UTC

Hi,

I too have the same exact problem.  I upgraded from Ubuntu 8.04 to (2.24) to Ubuntu 8.10 (2.26) and have been bleeding files left right.  I only discovered this when I started to backup my 4 Disk RAID 5 array, that is running off a HighPoint 1640 card in JBOD mode and Software RAID.  I've had the array running for about 2 years with no problems whatsoever.

All my drives are SATA.  And like the previous poster, everything happened exactly the same.  I'm randomly bleeding files and fsck can no longer guarantee me any fixes.  I'm now mounting my array in read-only mode and just copying whatever file it lets me copy.  I had not run fsck on it in a long time, so I only detected it by accident.  I'm certain there are many HPT users out there loosing files and not knowing it.  This is a very serious bug.  Please let me know if I can be of any help.

$Linux ubuntuserver 2.6.27-9-generic #1 SMP Thu Nov 20 21:57:00 UTC 2008 i686 GNU/Linux

$lscpi -v
00:09.0 RAID bus controller: HighPoint Technologies, Inc. HPT374 (rev 07)
	Subsystem: HighPoint Technologies, Inc. Device 0001
	Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 17
	I/O ports at a000 [size=8]
	I/O ports at a400 [size=4]
	I/O ports at a800 [size=8]
	I/O ports at ac00 [size=4]
	I/O ports at b000 [size=256]
	[virtual] Expansion ROM at 40100000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
	Kernel driver in use: pata_hpt37x
	Kernel modules: pata_hpt37x

00:09.1 RAID bus controller: HighPoint Technologies, Inc. HPT374 (rev 07)
	Subsystem: HighPoint Technologies, Inc. Device 0001
	Flags: bus master, 66MHz, medium devsel, latency 120, IRQ 17
	I/O ports at b400 [size=8]
	I/O ports at b800 [size=4]
	I/O ports at bc00 [size=8]
	I/O ports at c000 [size=4]
	I/O ports at c400 [size=256]
	Capabilities: [60] Power Management version 2
	Kernel driver in use: pata_hpt37x
	Kernel modules: pata_hpt37x

Thanks.

Comment 43 Alan 2009-01-13 08:25:38 UTC

Waiting 374 hardware from Andre

Comment 44 Dennis Schafroth 2009-03-22 02:58:09 UTC

I am having the same problem on a RAID 5 on a Fedora kernel 2.6.27.19-170.2.35. 

I am also running a RAID 1 on the same controller, and it seems to switch between them (between boots) which are affected. 

I sometimes see some a way too big hard disk size in the BOOT

[sdc] 72058694039790464 512-byte hardware sectors (36894051348373 MB)

but this did not happen on last boot, and I do have the issue on the RAID5

As soon I have copied data away, I will be willing to try out stuff.

Comment 45 Alan 2009-03-23 10:17:38 UTC

Dennis: Please open a seperate bug - and if its the fedora kernel it is best to start with the fedora bugzilla

Thanks

Comment 46 F. P. Beekhof 2009-03-23 10:29:00 UTC

bugme-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=11328
> 
> 
> 
> 
> 
> ------- Comment #43 from alan@lxorguk.ukuu.org.uk  2009-01-13 08:25 -------
> Waiting 374 hardware from Andre
> 
> 

Has any hardware arrived ? If not, I don't mind sending mine; it's 
useless now anyway. If ever this bug is resolved I'd like to have it 
back though.

Comment 47 Roy Abou Assaly 2009-03-23 20:42:13 UTC

As a workaround, I've been using the proprietary drivers listed under "Open Source driver v2.19" (http://www.highpoint-tech.com/USA/bios_rr1640.htm) since January 15 without any problems with the file system.

Please note that it this driver is not a complete open source one, as it has as proprietary binary.

Comment 48 R. Daemen 2009-10-10 10:33:11 UTC

I'm bumping this bug because I just suffered the same bug in kernel 2.6.26 (Debian Lenny kernel). However, in my case the controller is *not* a HighPoint RocketRAID card but a simple Sil3124 controller. The problem is *exactly* as described here and caused my system to die. I recently reinstalled the system with Ubuntu Hardy (8.0.4, LTS) because it had the 2.6.24 kernel. With this kernel version my performance is back up as it should and I've seen no more data corruption on my filesystems (Using both RAID1 and RAID5 using md). 

I must make a remark about this though: The Ext3 filesystem on the RAID1 ('root' filesystem) suffered the most severe corruption and I could only recover very limited set of data, most files where just unreadable/merged with other files. The RAID5 set, using the JFS filesystem (mounted on /var/), wasn't affected as much. There was some data corruption on it but not as severe as on "/". Note that the mdadm reported all arrays as healthy the entire time, including during recovery of the data using a live disk.

The system config was a Via Epia LT15000 (CX700 chipset) with a Sil3124 PCI-SATA card with 4 Samsung SpinPoint HD160JJ disks. I can not go back and retry to recreate the errors (in order to get logs) because this system is in production.

I've stressed the raid sets by writing over 400GB of data to/from it and I have not suffered any corruption since I started using the 2.6.24 kernel. I will update my bug at debian bugtrack as well.

Comment 49 Tejun Heo 2009-10-12 07:31:54 UTC

Daemen, did the kernel whine about anything over those corruptions?

Comment 50 R. Daemen 2009-10-12 09:05:08 UTC

The moment I noticed the corruption, the console was overrun with io read errors. Unfortunately, no logs where preserved (some where but those where unreadable, garbled text) so the exact errors I saw I cannot tell. I did notice that when accessing files that I sometimes had to try three to four times for the system to open that file (giving READ ERROR in the application trying to read the files) and in some occasions the file was corrupted. Strange enough I don't remember seeing any write errors, just read. This was before the system stopped responding.

The kernel whined about media errors when stressing the system, eventually resetting all the SATA devices at the same time. During the recovery (using the older 2.6.24 kernel) none of these errors where logged nor did I get any more read errors. The system died because it started wining about corrupted files on '/' filesystem and on reboot it could no longer locate the kernel image. During recovery I checked if I could still find it, but the kernel image had 'disappeared' completely from the fs, probably due to the fschk that had to be done in order to mount the fs (system did not cleanly shutdown, it froze and stopped responding to anything, didn't even kernel-panic, hard reset the machine after waiting for about 5 minutes). All the data I could recover from the '/' fs was moved to lost+found after the fschk.

I cannot test this because I needed the machine back online a.s.a.p., just wanted to make a note of the similarities in the problems for a different controller.

Regards,
Rudi

Comment 51 Tejun Heo 2009-10-12 09:19:00 UTC

Without log, I can't tell much but all disks being reset at the same time strongly suggests power problem.  2.6.27 implements proper barrier support for RAID arrays which means when the filesystem issues barrier all member disks of an array will flush at the same time spiking up the power usage.  If the PSU is inadequate, this will make the voltage drop briefly, the drives will spindown and lose whatever was in their buffers.  When the voltage recovers, drives spin back up and reestablish communication with the host.  Host notices links went off and then back online again but to host the event seems like a transient transmission problem, so the driver resets the drives, reconfigure them and go on.  As data is lost in the process but the system is going on as if nothing has happened, the filesystem on the array will get seriously screwed up.

I'm fairly certain this would be what you saw and you would be the fourth person reporting exactly the same problem.  Kernel logs and smartctl outputs over those failures can confirm the problem (kernel reports PHY events and smartctl output will log those brief spindown events).  So, it's very likely that your machine has inadequate power supply.  If pre-barrier kernels have had no problem for a long time, turning off barrier support might be enough but if I were you I wouldn't trust any valuable data on that array regardless of kernel version.

Comment 52 R. Daemen 2009-10-12 09:42:09 UTC

I am having a hard time believing the powersupply is the problem as it is more then sufficient for the system. The disks do not spin down at all at any point. Please note: The system is an embedded Via board with peak consumption of 22 watts, all I added to this was the controller (No power info, used 20watts as a reference), 4 disks (20watt peak during spinup) and a 250watt output psu. I did check all power requirements for the system components and found out the peak power consumption would be 120watts during booting due to all disks spinning up. After that the consumption drops to about 40~50 watts depending on system/disk load. The voltages given by this powersupply are well within the nominal range, less then 5% deviation.

But to be sure this is or is not happening I'll try to get a multi-meter and monitor the voltages with an external source rather then just go for the values reported using the system itself.

Comment 53 Tejun Heo 2009-10-12 09:50:58 UTC

I don't think you'll be able to see the voltage drop with a multimeter.  Voltage drop would be very short.  Reproducing the problem and comparing smartctl output before and after is sure way to tell.  Plus, the problem doesn't seem to have much to do with the rated wattage of the power supply.  My electric knowledge is pretty limited but my hunch is that adding a large capacitor between 12v output lines would probably solve the problem.

Comment 54 R. Daemen 2009-10-12 10:41:14 UTC

Which value should I specifically be looking for when using smartctl? I looked at the output it gave me before the system died but never made a log out of it as it did not show me anything out of the ordinary. If I look at the current smartctl output it gives no weird values at all, I included one of the 4 disks below. All 4 disks report the same values (excluding the disk I used for failover testing which differs only marginally). Only thing I noticed for which I don't know if it indicates a problem is the Hardware_ECC_Recovered counter which is over 16mill, smart is not tripped by this either way.

On a side note: Adding a capacitator to my powersupply seems to me as a void in it's warranty as it's just over 1 year old ;) On a side note: The disks are in a disk bay that uses two power connectors. I connected them both to different 12v power rails, so the peak per 12v rail would at most be 2 disks rather then 4.

output of smartctl for /dev/sda:
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint P80 SD series
Device Model:     SAMSUNG HD160JJ
Serial Number:    S08HJ10L443925
Firmware Version: ZM100-41
User Capacity:    160,041,885,696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 4a
Local Time is:    Mon Oct 12 12:02:04 2009 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                 (3677) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  61) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0007   100   100   025    Pre-fail  Always       -       6016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       25
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   253   253   015    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       27642
 10 Spin_Retry_Count        0x0033   253   253   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   253   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       23
187 Unknown_Attribute       0x0032   253   253   000    Old_age   Always       -       0
190 Temperature_Celsius     0x0022   124   118   000    Old_age   Always       -       38
194 Temperature_Celsius     0x0022   124   118   000    Old_age   Always       -       38
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       16614423
196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   253   253   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
202 TA_Increase_Count       0x0032   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     27469         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Comment 55 Tejun Heo 2009-10-12 10:46:26 UTC

Oh, please don't put a capacitor on the output lines on my advice.  It was just a random hunch.  :-P

Anyways, the count which are usually incremented by such events are some combination of Start_Stop_Count, Power_Cycle_count and Power-Off_Retract_Count.  Those counters increment after such an event.  Another thing you can try is getting a separate ATX power supply (any cheap one) and power it up and connect half of the drives to that one.  You can power up a ATX PSU without motherboard by doing the following.

  http://modtown.co.uk/mt/article2.php?id=psumod

Comment 56 R. Daemen 2009-10-12 12:48:18 UTC

Seen the very low numbers in the fields you mentioned for a drive that's over 3 years old, I'd say it might not be the power issue :) Besides, the data corruption problem didn't occur until I swapped the buggy Promise card with a well supported Sil3124 card and after the swap it took less then 4 days to crash the system. (Note: This system has been operating for over a year on the promise card before I replaced it)

Anyways, no need for me to mod an ATX psu, I still have an old 200watt AT psu which I use for testing of disk/cd drives. If the psu would be out of warranty, I'd probably do the capacitator mod ;-) Just for the sake of it :-P but I think this is getting somewhat off-topic :-P

Either way, thanks for the responses, I'll be sure to check this possible power issue again if I notice any disk corruption or resets of all SATA drives, and then make sure I capture log entry's on an USB storage device so they survive in case of system failure. However seen how the system is running now, I don't expect to see the bugs anymore. I've stressed the system with massive IO's and not a single reset, corruption or any performance drop occurred so far. Whenever my server goes offline for any reason, I'll try a Live CD with one of the newer kernel revisions and see if I can reproduce it on newer kernel versions.

Comment 57 jb 2010-04-29 11:01:56 UTC

For chipset cx700 if modules ehci-hcd, uhci-hcd, shd-hda-intel, via-agp, 8139too and ath5k are loaded disks are crashed. Somtimes it looks like usb onnection with external hdd was failed (in the middle of copying data). Sometimes internal ata disk stops working (ID 5372).

jb

Comment 58 Ralf-Peter Rohbeck 2011-01-31 06:05:57 UTC

I encountered spurious silent data corruption with pata_hpt37x in Ubuntu Maverick (2.6.35-24-generic and 2.6.35-25-generic kernels.) Building and installing HighPoint's hpt374.ko 'Open Source' driver fixed it.

The configuration is 5 mixed drives in a md RAID5, 4 on a HighPoint 1640 and one on a motherboard SATA port. The system is a Compaq Presario (2.6GHZ P4, 2.5 GB memory) which uses an ASUS OEM motherboard. There's also an Adaptec 29320 SCSI HBA with SDLT drives attached. See the attached files for details.

There was nothing in the logs. The data corruption occurred while reading and usually appeared as 4 random bytes at the end of a 64K block, but sometimes there were longer stretches of corrupt data (they could have been caused by bad metadata of course.) This happened under high disk load - I never noticed it under normal desktop use, only when I restored a tape (several 100 GB of files) or ran lengthy file comparisons with cmp.

A comparison of original vs corrupt file looks like
$ diff g1 b1
4608c4608
< 1e18fff0:b9c7 43a2 868c 6e40 01b7 1956 17a3 dbf6  ..C...n@...V....
---
> 1e18fff0:b9c7 43a2 868c 6e40 01b7 1956 4827 412b  ..C...n@...VH'A+
Dropping caches and trying again cleared the corruption or caused it to show up somewhere else.

See http://ubuntuforums.org/showpost.php?p=10413437&postcount=20 on the http://ubuntuforums.org/showthread.php?p=10413437#post10413437 thread.
I'll attach some files. Mail me if I should run any tests - this is easily reproducible with the original driver.

Ralf-Peter

Comment 59 Ralf-Peter Rohbeck 2011-01-31 06:09:02 UTC

Created attachment 45572 [details]
lspci

Comment 60 Ralf-Peter Rohbeck 2011-01-31 06:15:59 UTC

Created attachment 45582 [details]
dmesg

Nothing in there except for the details of my drives etc.
/home is ext4 on /dev/mapper/big-home. lvm is on md2, a 5-drive RAID5. That's where the corruption occurred.

Comment 61 Ralf-Peter Rohbeck 2011-01-31 06:17:05 UTC

Created attachment 45592 [details]
dmidecode

Comment 62 Ralf-Peter Rohbeck 2011-01-31 06:18:24 UTC

Created attachment 45602 [details]
lvm info

Comment 63 Ralf-Peter Rohbeck 2011-01-31 06:18:55 UTC

Created attachment 45612 [details]
fstab

Comment 64 Tejun Heo 2011-01-31 10:02:16 UTC

hpt374 definitely seems to have a data corruption issue but I have no idea what it's caused by or how to fix it.  Alan, any ideas?

Comment 65 R. Daemen 2011-03-15 11:50:01 UTC

Sorry for the bump but maybe something to keep in mind with this issue. Some reading up on the internet has gotten me on a different track about the loss of the RAID Arrays in my old setup:

It seems starting with kernels newer then 2.6.24 there might be a change in the timers mdadm uses to declare a disk dead or alive. This is possible related to ERC/CCTL/TLER settings: The time a drive is allowed to use to recover from a read error (e.g. bad sector). Normal consumer disks have these ERC/CCTL/TLER setting disabled by default (or don't have it at all) causing them to take up to a minute to respond again to the OS.

With ERC/CCTL/TLER enabled, a disk will respond to the OS/RAID driver again after a specified amount of time (e.g. 7 seconds). The RAID driver then sees the drive is still alive and corrects the data using the data on other drives.

Could it be that mdadm's disk timers have been changed causing newer kernels to drop groups of disks from the array?

Comment 66 Tejun Heo 2011-03-15 15:25:12 UTC

R. Daemen, can you please open a separate bug report?  That's a different issue, right?

Comment 67 Ralf-Peter Rohbeck 2011-03-18 04:58:04 UTC

I installed Debian squeeze (which uses the hpt366 driver) and the problem is gone.

Comment 68 Alan 2012-10-30 15:01:55 UTC

If this is still seen on modern kernels then please re-open/update

Note You need to log in before you can comment on or make changes to this bug.