Bug 10480 - sil3114 yields "ext3_new_block: Allocating block in system zone"
Summary: sil3114 yields "ext3_new_block: Allocating block in system zone"
Status: CLOSED WILL_NOT_FIX
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Tejun Heo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-18 23:48 UTC by Ha Quoc Viet
Modified: 2012-05-21 15:05 UTC (History)
12 users (show)

See Also:
Kernel Version: 2.6.25 - 2.6.23
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg of last working kernel (2.6.21.7) (19.41 KB, text/plain)
2008-04-23 22:33 UTC, Ha Quoc Viet
Details
dmesg of first buggy kernel (2.6.22) (19.70 KB, text/plain)
2008-04-23 22:33 UTC, Ha Quoc Viet
Details
This little c program can be used to generate a test file (170 bytes, text/x-csrc)
2008-04-25 01:26 UTC, Tejun Heo
Details
Output of lspci -nnvvvxxx via CentOS 4.7 installer linux rescue mode (15.81 KB, text/plain)
2009-01-18 23:59 UTC, Aaron Greenspan
Details
Output of lspci (2.13 KB, text/rtf)
2009-01-26 19:45 UTC, Brett Park
Details
System dmesg (39.41 KB, application/octet-stream)
2009-02-03 03:33 UTC, Juris Krumins
Details
Error of my system (124.63 KB, application/octet-stream)
2009-02-03 03:34 UTC, Juris Krumins
Details
lspci info of my system (83.56 KB, application/octet-stream)
2009-02-03 03:34 UTC, Juris Krumins
Details
partitioning information (741 bytes, application/octet-stream)
2009-02-03 03:35 UTC, Juris Krumins
Details
Examples of exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen (328.11 KB, text/plain)
2009-02-03 18:10 UTC, Brett Park
Details
Syslog containing EXT3-fs error (137.41 KB, text/plain)
2009-02-03 18:12 UTC, Brett Park
Details
lspci -nnvvvxxx (21.75 KB, text/plain)
2009-04-14 04:34 UTC, Alex Vangelion
Details
lspci -nnvvvxxx (13.24 KB, text/plain)
2009-04-14 18:45 UTC, grabben.kernel
Details
"lspci -nnvvvxxx" output after data write failure (21.75 KB, text/plain)
2009-04-24 22:51 UTC, Alex Vangelion
Details
sil-dbg.patch (525 bytes, patch)
2009-04-26 01:56 UTC, Tejun Heo
Details | Diff
SiI 3132 corruption on XFS (6.91 KB, text/plain)
2009-06-04 21:09 UTC, Richard Huddleston
Details
lspci info and /proc/interrupts on 440BX, working perfectly (15.26 KB, text/plain)
2009-09-02 14:37 UTC, Pascal Vandeputte
Details
lspci info and /proc/interrupts on KT133A, broken (18.54 KB, text/plain)
2009-09-02 14:38 UTC, Pascal Vandeputte
Details

Description Ha Quoc Viet 2008-04-18 23:48:48 UTC
Latest working kernel version: 2.6.21
Earliest failing kernel version: 2.6.23 afaik, maybe 2.6.22 too
Distribution: on debian testing i386 update as of april 14th 2008
Hardware Environment: celeron 500/66FSB, 600MB RAM, TRANSCEND AVD1 with VT82C692BX
Software Environment: debian testing, no X
Problem Description: data corruption on sata SiliconImage 3114

I have had this issue on : 
ext3 over raid5 (md) (9 x seagate 250GB sata)
ext3 over raid0 (md) (9 x seagate 250GB sata)
ext3 over ----- (no md) (seagate 320GB sata)
someone in ubuntu forums has reported the same issue with xfs over raid5 (md)
I am currently adding xfs to my kernel to test it.

also on another computer running debian testing adm64 (two 285 opterons).

Steps to reproduce:
copy data on a sata drive, connected to a sata sil3114

The logs are a bit different in 2.6.25 compared to previous kernels.
On 2.6.25, the logs will spew :
Apr 19 06:01:32 Backup kernel: EXT3-fs error (device md0): ext3_valid_block_bitmap: Invalid block bitmap - block_group = 5709, block = 187072514
Apr 19 06:01:32 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 187072517, length 1
Apr 19 06:01:52 Backup kernel: EXT3-fs error (device md0): ext3_valid_block_bitmap: Invalid block bitmap - block_group = 5711, block = 187138050
Apr 19 06:02:40 Backup kernel: EXT3-fs error (device md0): ext3_valid_block_bitmap: Invalid block bitmap - block_group = 7464, block = 244580354

on 2.6.24.3 and 2.6.23 :
Feb 24 06:38:59 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 24870917, length 1
Feb 24 06:38:59 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 24870918, length 1
Feb 24 06:38:59 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 24870919, length 1
Comment 1 Ha Quoc Viet 2008-04-19 01:23:06 UTC
ok, so I have just tested xfs on raid0 (md) with 9 x seagate 250GB sata 

I am getting the following logs on 2.4.25 countless times, as soon as the transfert starts :
Apr 19 10:18:19 Backup kernel: Filesystem "md0": XFS internal error xfs_btree_check_sblock at line 334 of file fs/xfs/xfs_btree.c.  Caller 0xe8bbbe38
Apr 19 10:18:19 Backup kernel: Pid: 124, cock+0xa0/0xaf [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8bbbe38>] xfs_alloc_lookup+0x131/0x34a [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8bbbe38>] xfs_alloc_lookup+0x131/0x34a [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8c009c7>] kmem_zone_zalloc+0x1c/0x3d [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8bba753>] xfs_alloc_ag_vextent_near+0x46/0x8b6 [xfs]
Apr 19 10:18:19 Backup kernel:  [<c01212c0>] run_timer_softirq+0x199/0x1b3
Apr 19 10:18:19 Backup kernel:  [<c011e1a4>] __do_softirq+0x59/0x85
Apr 19 10:18:19 Backup kernel:  [<e8bbafed>] xfs_alloc_ag_vextent+0x2a/0xbf [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8bbb7e1>] xfs_alloc_vextent+0x2e6/0x43d [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8bcbafa>] xfs_bmap_btalloc+0x76e/0x977 [xfs]
Apr 19 10:18:19 Backup kernel:  [<c0114587>] update_curr+0x3d/0x52
Apr 19 10:18:19 Backup kernel:  [<e8be47ad>] xfs_iext_bno_to_ext+0xd8/0x191 [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8bcc4c6>] xfs_bmapi+0x7be/0x1185 [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8becb07>] xlog_grant_push_ail+0xe0/0x120 [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8be9848>] xfs_iomap_write_allocate+0x243/0x356 [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8bea535>] xfs_iomap+0x2c8/0x32f [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8c00e06>] xfs_map_blocks+0x2a/0x77 [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8c02059>] xfs_page_state_convert+0x368/0x65d [xfs]
Apr 19 10:18:19 Backup kernel:  [<e8c0246d>] xfs_vm_writepage+0x8a/0xbd [xfs]
Apr 19 10:18:19 Backup kernel:  [<c0148584>] __writepage+0x8/0x1f
Apr 19 10:18:19 Backup kernel:  [<c01489e6>] write_cache_pages+0x153/0x260
Apr 19 10:18:19 Backup kernel:  [<c014857c>] __writepage+0x0/0x1f
Apr 19 10:18:19 Backup kernel:  [<e8c00faa>] xfs_vm_writepages+0x0/0x51 [xfs]
Apr 19 10:18:19 Backup kernel:  [<c0148b0d>] generic_writepages+0x1a/0x21
Apr 19 10:18:19 Backup kernel:  [<c0148b34>] do_writepages+0x20/0x30
Apr 19 10:18:19 Backup kernel:  [<c017655f>] __writeback_single_inode+0x157/0x279
Apr 19 10:18:19 Backup kernel:  [<c012007b>] warn_legacy_capability_use+0x18/0x34
Apr 19 10:18:19 Backup kernel:  [<c0176974>] sync_sb_inodes+0x179/0x23d
Apr 19 10:18:19 Backup kernel:  [<c0176d59>] writeback_inodes+0x6f/0xea
Apr 19 10:18:19 Backup kernel:  [<c0149575>] pdflush+0x0/0x1c9
Apr 19 10:18:19 Backup kernel:  [<c01491e2>] wb_kupdate+0x6f/0xd1
Apr 19 10:18:19 Backup kernel:  [<c0149575>] pdflush+0x0/0x1c9
Apr 19 10:18:19 Backup kernel:  [<c0149694>] pdflush+0x11f/0x1c9
Apr 19 10:18:19 Backup kernel:  [<c0149173>] wb_kupdate+0x0/0xd1
Apr 19 10:18:19 Backup kernel:  [<c012998d>] kthread+0x36/0x5d
Apr 19 10:18:19 Backup kernel:  [<c0129957>] kthread+0x0/0x5d
Apr 19 10:18:19 Backup kernel:  [<c010530f>] kernel_thread_helper+0x7/0x10
Comment 2 Ha Quoc Viet 2008-04-19 03:44:55 UTC
2.6.22 mainline fails too (i586, ext3 over software raid0)

Apr 19 12:28:50 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 183304197, length 1
Apr 19 12:30:48 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 183959557, length 1
Apr 19 12:33:33 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 184778757, length 1
Apr 19 12:37:15 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 184123397, length 1
Apr 19 12:40:40 Backup kernel: EXT3-fs error (device md0): ext3_new_block: Allocating block in system zone - blocks from 185958404, length 1
Comment 3 Adrian Bunk 2008-04-19 10:33:03 UTC
Please make an attachment with the output of "dmesg -s 1000000" direct after booting with a the last working kernel and another attachment with "dmesg -s 1000000" direct after booting with the first broken kernel.
Comment 4 Ha Quoc Viet 2008-04-23 22:33:02 UTC
Created attachment 15880 [details]
dmesg of last working kernel (2.6.21.7)
Comment 5 Ha Quoc Viet 2008-04-23 22:33:46 UTC
Created attachment 15881 [details]
dmesg of first buggy kernel (2.6.22)
Comment 6 Ha Quoc Viet 2008-04-23 22:36:16 UTC
I confirm that md has nothing to do with the bug (tested with ext3 on the device itself, rather than ext3 over raid-something over the device)
Comment 7 Tejun Heo 2008-04-23 22:43:44 UTC
Jan, can you please enlighten me on what those ext3 error message mean?
Comment 8 Jan Kara 2008-04-24 00:09:44 UTC
They generally mean the filesystem is corrupted. In this particular case, bitmap of used blocks is corrupted as system blocks (most likely inode table) are marked as freed in the bitmap of used blocks and then we tried to allocate from there...
Comment 9 Ha Quoc Viet 2008-04-24 00:17:32 UTC
is it that libata has evolved from 2.20 to 2.21 between the two kernels ?

anyway, both ext3 and xfs are failing, it would seem that the issue is in the driver

(and architecture independent, since i m having this on amd64 and i586)
Comment 10 Tejun Heo 2008-04-24 00:44:20 UTC
Can you please post the result of "lspci -nn"?  We had several reports of data corruption on sata_sil on nvidia chipsets.
Comment 11 Ha Quoc Viet 2008-04-24 01:29:52 UTC
cat lspci-nn.txt
00:00.0 Host bridge [0600]: VIA Technologies, Inc. VT82C693A/694x [Apollo PRO133x] [1106:0691] (rev 44)
00:01.0 PCI bridge [0604]: VIA Technologies, Inc. VT82C598/694x [Apollo MVP3/Pro133x AGP] [1106:8598]
00:07.0 ISA bridge [0601]: VIA Technologies, Inc. VT82C596 ISA [Mobile South] [1106:0596] (rev 23)
00:07.1 IDE interface [0101]: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE [1106:0571] (rev 10)
00:07.2 USB Controller [0c03]: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller [1106:3038] (rev 11)
00:07.3 Host bridge [0600]: VIA Technologies, Inc. VT82C596 Power Management [1106:3050] (rev 30)
00:10.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ [10ec:8139] (rev 10)
00:11.0 Mass storage controller [0180]: Promise Technology, Inc. PDC20575 (SATAII150 TX2plus) [105a:3d75] (rev 02)
00:12.0 RAID bus controller [0104]: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
00:14.0 Mass storage controller [0180]: Promise Technology, Inc. PDC20575 (SATAII150 TX2plus) [105a:3d75] (rev 02)
Comment 12 Ha Quoc Viet 2008-04-24 23:28:02 UTC
It seems that the bug already existed in 2.6.21.7 :
I have just had an ext2 error, two actually, on ext2 over a raid0 array (9 x 250gb), where I would normally be swamped by them as soon as the transfer starts (using ext3, but does it make a difference ?). Here, I have been creating files for 36 hours on ext2 before getting only two errors :

Backup:~# egrep "EXT2-fs error" /var/log/syslog.0
Apr 25 02:07:04 Backup kernel: EXT2-fs error (device md0): ext2_new_block: Allocating block in system zone - block = 148570116
Apr 25 06:44:23 Backup kernel: EXT2-fs error (device md0): ext2_new_block: Allocating block in system zone - block = 536248324
Backup:~#

I'll test 2.6.21.7 again with ext3 on the long run
Comment 13 Tejun Heo 2008-04-25 01:25:01 UTC
Can you please create a simpler test case?  e.g. Something like the following.

# CKSUM=$(dd if=somebigfile bs=1M count=1M | md5sum)
# while true; do dd if=somebigfile of=/dev/sda bs=1M count=1M; CURSUM=$(dd if=/dev/sda bs=1M count=1M | md5sum); if [ $CKSUM != $CURSUM ]; then break; fi; done
Comment 14 Tejun Heo 2008-04-25 01:26:25 UTC
Created attachment 15905 [details]
This little c program can be used to generate a test file
Comment 15 nate 2009-01-05 11:10:10 UTC
Hi.  I am posting because I have fought the exact same symptom for the past 4 days and have resolved the issue.  After reading this post, I stepped my kernel back from 2.6.27 all the way back to 2.6.15, trying each minor release.  Reverting the kernel did not resolve the issue.  I can confirm that this symptom with the sil3114 results in data corruption regardless of the underlying filesystem (i.e. ext2, ext3, reiserfs, etc.)

The issue was that my sil3114 card unknowingly had an incomplete/corrupt BIOS flash.  The card would boot, post, find drives, and otherwise appear to function completely in linux.. However writing a modest amount of data to a formatted volume would repeatedly cause data corruption.  Re-flashing the bios and confirming that the flash was successful, has resulted in a perfectly functioning sil3114 card.

Download the sil3114 5.4.0.3 BIOS from Silicon Image, plus their latest flash utility (I used the DOS version).  Use an old Windows98 install CD to boot DOS for the flash, as this is the most convenient way to run the utility with cdrom support.  Run updflash first, with no params, and erase the bios once or twice. Quit the utility.  Re-run the updflash utility 'updflash R5403.BIN'.  This will automatically find and flash the card bios.

Ensure that after the flash that you see 'Flash Successful' and 'Return 0' (or something similar).  If you see 'Flash Failed, retry? Y/N', the bios is bad regardless of how it appears to function in the system.

Hope this helps.  I know I could have saved about 3-4 days of troubleshooting!
Comment 16 Tejun Heo 2009-01-06 18:50:46 UTC
Ah... it would have been interesting to have the before and after outputs of "lspci -nnvvvxxx" so that which PCI config the previous BIOS screwed up.  Thanks for the workaround.
Comment 17 Aaron Greenspan 2009-01-18 23:04:47 UTC
I've been trying to install CentOS 4.4, 4.7 and 5.2 for the past few days on a Dell PowerEdge 1650 server with a Syba SiI 3114-based 4-port RAID controller, and as soon as anaconda tries to start copying files to the hard drive(s) after formatting, whether or not software or hardware RAID is in use, the following errors (and many more like them) appear:

<2>EXT3-fs error (device sda3): ext3_new_block: Allocating block in system zone - block = 50102272
<3>Aborting journal on device sda3.
<2>EXT3-fs error (device sda3) in ext3_free_blocks_sb: Journal has aborted
<2>EXT3-fs error (device sda3): ext3_free_blocks: Freeing blocks in system zone - Block = 50102272, count = 1
<2>EXT3-fs error (device sda3) in ext3_free_blocks_sb: Journal has aborted
<2>EXT3-fs error (device sda3) in ext3_prepare_write: Journal has aborted
<2>ext3_abort called. 

I had tried to flash the 3114RAID BIOS as Nate suggested in the PowerEdge itself, but the system kept complaining of an "NMI System Parity Error" at module F000:C08C whenever I tried to run UPDFLASH, and it would not give the Return 0 successful exit code. I took the PCI card out of the PowerEdge and put it into a generic desktop system, where I was able to flash the BIOS using the above procedure successfully. When I put the card back into the Dell system, I tried to install CentOS once more, and ran into the exact same problem. So, basically Nate's procedure did not work for me, even though it's probably good in general to have a properly-flashed BIOS.

I'm kind of curious as to why the Dell NMI subsystem has a problem with the card. I tried taking out individual RAM DIMMs and it made no difference. According to http://support.dell.com/support/edocs/systems/pe1650/en/sm/beep.htm, the error message "System parity error" can mean a "defective" expansion card. I don't think the card is defective since two identical cards in two identical servers are yielding the same result, but clearly something is wrong...
Comment 18 Aaron Greenspan 2009-01-18 23:59:09 UTC
Created attachment 19884 [details]
Output of lspci -nnvvvxxx via CentOS 4.7 installer linux rescue mode
Comment 19 Aaron Greenspan 2009-01-20 01:34:43 UTC
I believe I've figured out why the NMI subsystem was complaining about the PCI card and yielding the "System parity error": according to the card's manufacturer (Syba), even though this isn't documented anywhere, it only works in 32-bit 5V PCI slots. My machine only has two 64-bit 3.3V PCI slots. This important but probably rarely-noticed limitation may apply to other inexpensive cards using the SiI 3114 chipset, which means (so far as I can tell) that they don't comply with the PCI 2.2 spec. (Nonetheless, the box of my product claims that it does.) For more details and examples of frustrated SiI 3114 users, see the debian-users list:

http://lists.debian.org/debian-user/2009/01/threads.html#00202
Comment 20 Tejun Heo 2009-01-21 05:24:12 UTC
Yeah, I read and wrote to the thread.  Heh... this problem is getting more interesting.  Aaron, another user reported that the corruption occurred on the higher sectors for w/r 0x00 pattern testing.  Can you please try to run badblock on the disk and see which pattern fails?  Also, can you please try to determine whether the corruption occurs during read or write?  You can test it by "dd if=some-known-file of=/dev/sdXn bs=1M count=some_mega_bytes" and then dd if=/dev/sdXn bs=1m of=- bs=1M count=same_mega_bytes | md5sum".  If you repeat the second part multiple times and the md5sums are consistent but different from the original one, write is corrupting.  If you see inconsistent md5sums, read and maybe write are corrupting.

Thanks.
Comment 21 Aaron Greenspan 2009-01-21 10:36:15 UTC
Sorry, Tejun, I returned the cards to the retailer where I purchased them. However, it's clear just from anaconda's failures that data corruption was taking place immediately on write. I don't know about read, because I could never get anything written to the drive to be read in the first place.

Another important test would be to see how supposedly-PCI 2.2 compliant expansion cards such as mine behave in 5V slots versus 3.3V slots, but since I don't have any 5V slots, I couldn't try that.
Comment 22 Brett Park 2009-01-26 19:45:13 UTC
Created attachment 20004 [details]
Output of lspci
Comment 23 Brett Park 2009-01-27 17:43:29 UTC
I am running into this problem currently running 8.10 w/2.6.27-7-server. I have two different sata cards (both with 3114 chipset) and have encountered the issue with both cards even after updating to the newest bios. The error appears to happen under heavy load, quite often when using multiple hard drives at once (I have two drives on the controller, and two onboard, utilizing a raidz (zfs) array). I have tried dropping on of the drives out of the array (currently on the 3114 controller) and changing to an ext3 filesystem. It appears to be more stable, but still drops out once and a while. I would love to help and resolve this problem if I can be of any help.
Comment 24 Juris Krumins 2009-02-03 02:12:09 UTC
Seems like I'm hitting the same problem with 2.6.28.2 kernel version installed on IBM x3650 with ServerRAID10K raid controller and EXP3000 disk bay (12 SATA disks, RAID0. Total LD size ~8TB).
Tested using different kernel versions, mainly with default CentOS system kernels 2.6.18 (CentOS 5.2 is used as a base system).

Right now running test on 2.6.28 with different FS flags, partition sizes and so on. More that that, I'm using all this with dm-crypt module.
Comment 25 Juris Krumins 2009-02-03 03:33:33 UTC
Created attachment 20088 [details]
System dmesg

dmesg of my system
Comment 26 Juris Krumins 2009-02-03 03:34:03 UTC
Created attachment 20089 [details]
Error of my system

Error of my system
Comment 27 Juris Krumins 2009-02-03 03:34:26 UTC
Created attachment 20090 [details]
lspci info of my system

lspci info of my system
Comment 28 Juris Krumins 2009-02-03 03:35:00 UTC
Created attachment 20091 [details]
partitioning information

partitioning information
Comment 29 Brett Park 2009-02-03 10:45:14 UTC
I tried compiling my own kernel following the ubuntu kernel guide (using ubuntu sources) and things seem much more stable. The original kernel was 2.6.27-7 and the new one is 2.6.27-11. Will continue testing.
Comment 30 Tejun Heo 2009-02-03 17:59:05 UTC
Brett, can you post kernel logs?  You're more likely to be seeing different problem.

Juris, you're not using sata_sil according to the posted log or lspci.
Comment 31 Brett Park 2009-02-03 18:10:47 UTC
Created attachment 20101 [details]
Examples of exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Comment 32 Brett Park 2009-02-03 18:12:32 UTC
Created attachment 20102 [details]
Syslog containing EXT3-fs error
Comment 33 Tejun Heo 2009-02-03 18:20:12 UTC
Brett, you're most likely seeing a power problem or disk failure.  I lean toward the former one tho.  Those PHYRDY_CHGs and following file system errors strongly suggest that the device is briefly losing power forgetting what's in its write buffer.  One way to verify is to run "smartctl -a /dev/sdX" on boot and run it after such failures and see whether unload, emergency unload or start_stop counts have been increased.  Another more physical way is to keep the ears on the drive and see whether there's the unloading clunking sound on error.  Or you can also prepare a separate power supply and move some of the drives to that one and see whether anything changes.
Comment 34 Brett Park 2009-02-03 19:56:27 UTC
Tejun, I have checked the smart results and when the errors occur the Start_Stop_Count increments as well as Power_Cycle_Count and Load_Cycle_Count. I will try a different power supply (as all the drives are brand new).
Comment 35 Tejun Heo 2009-02-03 20:31:02 UTC
Heh... yeah, take that PSU out and burn it while chanting.
Comment 36 Juris Krumins 2009-02-03 22:57:03 UTC
I'm not using sata_sil, but I have the exactly the same errors with my hardware.
Tejun, do I have to create new bug/problem report ?
Comment 37 Tejun Heo 2009-02-03 23:14:48 UTC
Yes, please.  I want to keep this one about the data corruption seen with sata_sil on certain configurations.  Thanks.
Comment 38 Brett Park 2009-02-04 18:51:37 UTC
Tejun, just to close loop, my issue was indeed with the power supply. thx.
Comment 39 Alex Vangelion 2009-04-13 05:38:51 UTC
I have a Syba card with the 3114 chipset which has been giving me data errors.  I updated the bios on the card to b5403 and was hoping it would solve the problem.  It didn't.

Playing with 'badblocks' has been enlightening.  I'm working with it at the device level, so filesystem isn't an issue.  The non-destructive read/write test returns no bad blocks on the small (100MB) partitions on several disks I'm using for testing.  As does the read only test.  Only the write over test and only on the 0xffff pattern is there a problem.  About 20% of the blocks fail.  Random data is no problem.
Comment 40 Tejun Heo 2009-04-14 02:55:21 UTC
Alex, can you please post the output of "lspci -nnvvvxxx"?
Comment 41 Alex Vangelion 2009-04-14 04:34:49 UTC
Created attachment 20968 [details]
lspci -nnvvvxxx
Comment 42 Alex Vangelion 2009-04-14 04:39:48 UTC
I was hoping that the 0xffff pattern would be the only thing to error, and I would generally not fill a whole block with 0xffff in real usage.  But I tried copying over some media files and getting a md5sum.  The failure rate seemed to go up with file size.  Is there a good way to find out how much different binary files are?
Comment 43 Tejun Heo 2009-04-14 07:06:54 UTC
"cmp -b -l" should show all mismatches.  Okay, you're on via.  Arghh... still have no idea what the hell is different.  :-(
Comment 44 Alex Vangelion 2009-04-14 07:31:42 UTC
Well, there is a Via SATA controller on the motherboard (which works fine, BTW), but my problem is with the disks attached to the SATA PCI card (using sata_sil).

Also, I can't find it again, but I think I read something about this bugtracker not being for distribution kernels.  I'm running Fedora10 with [an up to date] stock kernel.  Does that mean this is not the right forum for me?
Comment 45 Tejun Heo 2009-04-14 07:34:17 UTC
(In reply to comment #44)
> Well, there is a Via SATA controller on the motherboard (which works fine,
> BTW), but my problem is with the disks attached to the SATA PCI card (using
> sata_sil).

The sil data corruption problem seems to be PCI bridge dependent.  Older nvidia and via seem to be affected.

> Also, I can't find it again, but I think I read something about this
> bugtracker
> not being for distribution kernels.  I'm running Fedora10 with [an up to
> date]
> stock kernel.  Does that mean this is not the right forum for me?

Well, sata_sil is pretty much the same between f10 and the current upstream.  It would be nice if you can verify the current upstream kernel (2.6.29) but I don't think it will behave differently.  :-(
Comment 46 Alex Vangelion 2009-04-14 08:31:03 UTC
I installed the 2.6.29.1-68 kernel from rawhide, and yes I have the same behavior.  (Is it considered the same if I'm matching the version number?)
Also, I did a compare on a 350 MB file I copied to the drives (md raid0, ext2).  I'm not sure I'm interpreting this correctly, but I think this is saying that there were indeed 0xff in the differing bytes.  I was hoping the problem only came up if an entire block was filled with 0xff.

 36812849 377 M-^? 347 M-g
 49594369 377 M-^? 357 M-o
119762945 377 M-^? 347 M-g         
313778177 377 M-^? 347 M-g
Comment 47 grabben.kernel 2009-04-14 18:43:50 UTC
Same problem, old Tyan dual PIII motherboard with 440BX (I think) chipset. Ubuntu kernel 2.6.27-11-generic. "Noname" sil3114. Will attach lspci -nnvvvxxx.
Comment 48 grabben.kernel 2009-04-14 18:45:49 UTC
Created attachment 20980 [details]
lspci -nnvvvxxx

lspci -nnvvvxxx
Comment 49 Tejun Heo 2009-04-23 03:44:32 UTC
Can you guys please post lspci -nnvvvxxx output after the data corruption problem has occurred?  Let's see if the PERR status bit is set.

Thanks.
Comment 50 Alex Vangelion 2009-04-24 22:51:44 UTC
Created attachment 21108 [details]
"lspci -nnvvvxxx" output after data write failure

Output of "lspci -nnvvvxxx" right after write failure using "badblocks -w -t 0xff".  All sections read "<PERR-".
Comment 51 Roland Kletzing 2009-04-25 19:48:15 UTC
i have built a solaris nv_110 (opensolaris nevada build, sora "bleeding edge" build) P4 based box (FSC Celsius) with 4 Seagate Barracuda ES.2 drives and did some burnin-testing yesterday. 

After writing around 250gb to the array, ZFS reported corruption with 2 disks (1 was degraded, 1 faulted).
I was yet not able to reproduce it, but i found this thread: http://lists.debian.org/debian-user/2009/01/msg00202.html and this open bugticket.

So, either my hardware setup is not optimal, or Solaris driver suffers from similar problems like linux one. 

please have a look at http://markmail.org/message/wbmngtspkqogwap5
if useful, here is a link to the solaris driver source:http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/io/dktp/controller/ata/sil3xxx.c

btw - solaris is using sil3114 in ide mode only, so i needed replace raid bios with ide bios.

i will do more testing, will also try accessing the same disks from linux (to see how it behaves there) and will post results here, if welcomed.
Comment 52 Tejun Heo 2009-04-26 01:56:22 UTC
Created attachment 21122 [details]
sil-dbg.patch

Can you please apply the attached patch and post boot log?
Comment 53 Richard Huddleston 2009-06-04 21:07:27 UTC
i'm seeing a similar issue with the sil 3132 controller on ubuntu 9.04 server with ubuntu mainline kernel 2.6.29-02062904-generic x86_64 

xfs on top of md raid 5  with 5 drives

silent data corruption reading back the same file ... different md5s each time !!! 

will attach system info
Comment 54 Richard Huddleston 2009-06-04 21:09:17 UTC
Created attachment 21756 [details]
SiI 3132 corruption on XFS
Comment 55 Tejun Heo 2009-06-05 01:26:03 UTC
Hmm... it's the first time someone reports data corruption on sil3124/32.  They're completely different controllers.  Can you please file a separate bug report?
Comment 56 Richard Huddleston 2009-06-08 18:04:15 UTC
SiI 3132 corruption on XFS ... well, turns out it is a memory caching issue ... reading directly from disk (clearing out memory cache) always returns the correct result.  Mem test hasn't found an issue yet, so I'm still looking for the cause.
Comment 57 grabben.kernel 2009-07-20 07:40:08 UTC
Output from patch in #52 is the following for me:
[    4.669833] sata_sil 0000:00:13.0: version 2.4
[    4.670080] sata_sil 0000:00:13.0: Applying R_ERR on DMA activate FIS errata
fix
[    4.670094] sata_sil 0000:00:13.0: XXX SIL_SYSCFG=0x0

2.6.29.4

/Grabben
Comment 58 Tejun Heo 2009-07-23 10:31:28 UTC
It's been a while but I think I was curious about M66EN bit.  Nothing is set in the register.  Unfortunately, I'm out of ideas at the moment.  I can't reproduce the problem and not really sure what makes the freebsd driver avoid the problem.  Differences I noticed didn't actually make any difference, so I'm out of ideas at the moment.  Ergh....
Comment 59 Pascal Vandeputte 2009-09-02 14:22:09 UTC
I bought 2 identical SiI3114 cards each coupled with 2 identical Western Digital WD10EADS drives. One of them is in my home router (an old Asus P3B-F board, 440BX I believe) and works perfectly. The other one is in a test machine with an MSI K7T Turbo2 (VIA KT133A chipset) and I immediately experienced weird issues with it:

- the DOS-based (!) firmware flash tool from SI couldn't detect the type of Flash chip on the card; specifying it manually also gave an error. I put the card in a different PC and there the tool could do everything automatically
- when using smartctl -A on one of the attached drives, I always see the error message "Warning! Drive Identity Structure error: invalid SMART checksum." at the top of the output.
- naturally, I also experience the EXT3 corruption this bug is about or I wouldn't be here

Both machines run the same Debian Lenny release with 2.6.26 kernel.

I'll attach as much information as possible.

I'm beginning to wonder which cheap SATA add-on board is actually guaranteed to work with most OSes. I've been burnt by a marvell-based controller earlier, had fairly good experiences with SiI 3124 on Solaris (though rather slow) and now tried the cheaper 3114 on Linux because performance wasn't a concern and I hoped it would be similar to the 3124 experience.
Comment 60 Pascal Vandeputte 2009-09-02 14:37:29 UTC
Created attachment 22976 [details]
lspci info and /proc/interrupts on 440BX, working perfectly
Comment 61 Pascal Vandeputte 2009-09-02 14:38:10 UTC
Created attachment 22977 [details]
lspci info and /proc/interrupts on KT133A, broken
Comment 62 Pascal Vandeputte 2009-09-18 11:30:29 UTC
Got myself a brand new Dawicontrol DC-4300 with SiI3124-2 chip, pulled the "BIOS enable" jumper so it works as a dumb disk controller, everything is working perfectly now.

I did have to switch PCI slots though, the screen remained black the first time. Long live PCIe.
Comment 63 Dan Rose 2010-02-20 09:19:31 UTC
I too have this problem, I think.  I get a different MD5sum on .iso files every time I try.

The system is a P4 in a VIA chipset with a VT8233 southbridge, unknown northbridge (heatsink).

I'm running a redhat/CentOS kernel 2.6.9-89.0.16.EL.

I have reflashed the card with the b5403, b5505 and r5403 to no avail, as well as adding slow_down=1 as an option for the module.

The latter results in this additional string being logged:

sata_sil 0000:00:0b.0: Applying R_ERR on DMA activate FIS errata fix

Have there been any solid workarounds discovered since September last year?  This is driving me crazy!
Comment 64 Tejun Heo 2010-02-23 01:56:34 UTC
Unfortunately, there currently is no known remedy for the problem.  There was a report that tweaking BIOS PCI bus options removes the problem.  It seems that these simg chips are sensitive to PCI signal quality and susceptible to data corruption when certain conditions are met - bus loading definitely seems contributing considering multiple reports are on configurations having more than on controller.  PCI host controller also seems to be a significant factor.  Most cases are on VIA chipsets.

I'm afraid at this point I don't have much idea what to do.  The only thing I can think of is moving around the controller to different slots and/or removing other controllers and seeing whether anything changes.

Thanks.

Note You need to log in before you can comment on or make changes to this bug.