Bug 63981 - Bad: Buffer I/O errors make disk unusable
Summary: Bad: Buffer I/O errors make disk unusable
Status: RESOLVED INVALID
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Serial ATA (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-28 19:38 UTC by Giuseppe Scalzi
Modified: 2013-10-29 15:14 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.12.0-rc6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg of the errors causing the problem (19.56 KB, application/octet-stream)
2013-10-28 19:38 UTC, Giuseppe Scalzi
Details

Description Giuseppe Scalzi 2013-10-28 19:38:44 UTC
Created attachment 112591 [details]
dmesg of the errors causing the problem

When I use my laptop, suddenly the SSD disk become unusable. The disk is mounted in read-only mode and the only way to get it work again is to reboot.
During the reboot, the file system check, fixes the errors and I can use the laptop for some hours after that the problem appear again.

This problem is difficult to reproduce because there are no precise steps to perform in order to cause the I/O errors showed by the attached dmesg. 

I had the same problem using kernel 3.11.5 and 3.10.6. I use a
Sony VAIO pro (Sony Corporation SVP1321C5E/VAIO, BIOS R1040V7 09/09/2013).

============================

Information about my system:

bash-4.2# cat /proc/scsi/scsi 
Attached devices:
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: SAMSUNG MZNTD256 Rev: DXT2
  Type:   Direct-Access                    ANSI  SCSI revision: 05

========================================

/etc/fstab

/dev/sda1        /                ext4        defaults         1   1
/dev/sda2        /home/      ext4        defaults         1   2
/dev/sda3        /media/hd1       ext4        defaults         1   2
#/dev/cdrom      /mnt/cdrom       auto        noauto,owner,ro,comment=x-gvfs-show 0   0
/dev/fd0         /mnt/floppy      auto        noauto,owner     0   0
devpts           /dev/pts         devpts      gid=5,mode=620   0   0
proc             /proc            proc        defaults         0   0
tmpfs            /dev/shm         tmpfs       defaults         0   0
tmpfs            /tmp             tmpfs defaults,noatime,nodiratime,mode=1777  0   0
tmpfs            /var/spool       tmpfs defaults,noatime,nodiratime,mode=1777  0   0
tmpfs            /var/tmp         tmpfs defaults,noatime,nodiratime,mode=1777  0   0

/proc/version

Linux version 3.12.0-rc6 (root@darkstar) (gcc version 4.8.1 (GCC) ) #1 SMP Sun Oct 27 19:02:16 CET 2013


Attached you will find the relevant part of dmesg.

Thanks for your help.
Comment 1 Theodore Tso 2013-10-28 22:35:09 UTC
From the errors listed in the dmesg, looks like it is a hardware problem with the SSD, not an ext4 bug.

I'd suggest doing a full backup of your disk while you still can, and try replacing the SSD....
Comment 2 Giuseppe Scalzi 2013-10-28 23:01:07 UTC
(In reply to Theodore Tso from comment #1)
> From the errors listed in the dmesg, looks like it is a hardware problem
> with the SSD, not an ext4 bug.
> 
> I'd suggest doing a full backup of your disk while you still can, and try
> replacing the SSD....

That's strange because I bought the laptop two weeks ago and for one week I used windows and all worked fine. I have this problem since the first day after installing Linux. 

Is it possible to check if there are some hardware errors from smartctl?

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZNTD256HAGL-00000
Serial Number:    S15ZNYAD730814
LU WWN Device Id: 5 002538 5000648f8
Firmware Version: DXT2300Q
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4c
Local Time is:    Mon Oct 28 23:47:51 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (53956) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  40) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       118
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       143
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       1
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   061   030   000    Old_age   Always       -       39
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       52
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       851312137

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Comment 3 Theodore Tso 2013-10-28 23:20:20 UTC
It's possible there is some kind of compatibility issue with the SATA driver on your Sony Viao, but the point is with errors like these:

[13546.661310] ata4.00: failed command: WRITE FPDMA QUEUED
[13546.661315] ata4.00: cmd 61/08:00:2f:1d:0a/00:00:00:00:00/40 tag 0 ncq 4096 out
[13546.661315]          res 40/00:01:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[13546.661318] ata4.00: status: { DRDY }
[13546.661319] ata4.00: failed command: WRITE FPDMA QUEUED
[13546.661323] ata4.00: cmd 61/08:08:27:1d:0a/00:00:00:00:00/40 tag 1 ncq 4096 out
[13546.661323]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[13546.661325] ata4.00: status: { DRDY }

... and like these:

[13606.886264] EXT4-fs warning (device sda1): ext4_end_bio:316: I/O error writing to inode 2097393 (offset 0 size 0 starting block 82853)
[13606.886267] Buffer I/O error on device sda1, logical block 82845
[13606.886268] sd 3:0:0:0: [sda] Unhandled error code
[13606.886270] sd 3:0:0:0: [sda]  
[13606.886271] Result: hostbyte=0x04 driverbyte=0x00
[13606.886273] sd 3:0:0:0: [sda] CDB: 
[13606.886274] cdb[0]=0x2a: 2a 00 00 00 00 3f 00 00 08 00
[13606.886282] sd 3:0:0:0: [sda] Unhandled error code
[13606.886283] Buffer I/O error on device sda1, logical block 0
[13606.886285] lost page write due to I/O error on sda1
[13606.886288] sd 3:0:0:0: [sda]  
[13606.886289] Result: hostbyte=0x04 driverbyte=0x00
[13606.886293] sd 3:0:0:0: [sda] CDB: 
[13606.886294] EXT4-fs error (device sda1): ext4_journal_check_start:56: 
[13606.886294] cdb[0]=0x2a: 2a 00

... there's little that we can do at the ext4 level.  Basically, the disk device (or the Sony Viao's SATA chipset) is refusing to talk to Linux.

The Sony Viao has, historically, been notorious for using Windows-specific hardware that doesn't work well with Linux.  I don't know anything about your specific model, but there have been enough problems in the past that I avoid Sony laptops like the plague if I intend to use Linux on them.  It's not by accident that most Linux kernel developers tend to use Lenovo Thinkpads...
Comment 4 Theodore Tso 2013-10-28 23:31:31 UTC
BTW, I'm using a 512GB Samsung 840 PRO (2.5" SATA SSD) and an 240GB Intel 525 SSD (mSata) on my Lenovo T430s, and they both work like a charm.

Hmm... I wasn't able to get detailed specs on your SAMSUNG MZNTD256HAGL-00000, but upon doing some further research, it appears to be a new-fangled M.2 PCIe interface.  So it's not a mSATA nor a 2.5" SATA interface, but Something New.

So whether or not this is a Linux bug, or an implementation bug in this new Samsung part (or a failure in the standardization of this new M.2 PCIe interface), I can't say, but this looks like the most likely cause is a problem with this new SSD or its new M.2 interface[1].

[1] http://en.wikipedia.org/wiki/Next_Generation_Form_Factor
Comment 5 Giuseppe Scalzi 2013-10-29 08:28:48 UTC
(In reply to Theodore Tso from comment #4)
> BTW, I'm using a 512GB Samsung 840 PRO (2.5" SATA SSD) and an 240GB Intel
> 525 SSD (mSata) on my Lenovo T430s, and they both work like a charm.
> 
> Hmm... I wasn't able to get detailed specs on your SAMSUNG
> MZNTD256HAGL-00000, but upon doing some further research, it appears to be a
> new-fangled M.2 PCIe interface.  So it's not a mSATA nor a 2.5" SATA
> interface, but Something New.
> 
> So whether or not this is a Linux bug, or an implementation bug in this new
> Samsung part (or a failure in the standardization of this new M.2 PCIe
> interface), I can't say, but this looks like the most likely cause is a
> problem with this new SSD or its new M.2 interface[1].
> 
> [1] http://en.wikipedia.org/wiki/Next_Generation_Form_Factor

Ok, thank you for you reply, I understand that isn't a problem related to EXT4.

I noticed from the archlinux wiki of my laptop model (https://wiki.archlinux.org/index.php/Sony_Vaio_Pro_SVP-1x21) that they suggest to use this option:

- When booting from USB, append libata.force=noncq to the kernel parameters to avoid problems with the SSD.

Well they say "when booting from USB" but I'll try "libata.force=noncq" anyway.

We will see what happens.

Note You need to log in before you can comment on or make changes to this bug.