Bug 202899 - Bogus hard drive IO errors on certain hardware
Summary: Bogus hard drive IO errors on certain hardware
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: io_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-13 08:22 UTC by bitjam
Modified: 2019-03-25 10:05 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.15.0 -- 4.20.12
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg output from first user (6.85 KB, text/plain)
2019-03-13 08:22 UTC, bitjam
Details
lspci -k output (7.10 KB, text/plain)
2019-03-13 08:24 UTC, bitjam
Details
dmesg output from 2nd user (2.37 KB, text/plain)
2019-03-13 08:25 UTC, bitjam
Details
dmesg output from kernel 4.14 (57.66 KB, text/plain)
2019-03-25 10:01 UTC, Gorštak
Details
dmesg output from kernel 5.0 (49.78 KB, text/plain)
2019-03-25 10:02 UTC, Gorštak
Details
lspci output (70.05 KB, text/plain)
2019-03-25 10:05 UTC, Gorštak
Details

Description bitjam 2019-03-13 08:22:14 UTC
Created attachment 281791 [details]
dmesg output from first user

Latest working kernel: 4.14.14
Earliest failing kernel: 4.15.0
Distro: MX Linux

We've had reports from two users who had IO errors on certain devices with MX Linux but not with other distros.  One user did a bisection search on the MX kernels and antiX kernels.  All kernels from 4.15.0 onward produce the IO errors preventing the device and/or partition from being read.

The problem was reported in this thread:
https://forum.mxlinux.org/viewtopic.php?p=490169#p490169

Another user reported a very similar problem but gave up in frustration.  Their problem was with a "7.2G TDK LoR TF10" usb stick.  Other distros could access the stick without a problem.  I'm not certain it is the same problem but it seems similar.  In any event, the problem is rare but repeatable.  It is fixed by using earlier kernels.
Comment 1 bitjam 2019-03-13 08:24:34 UTC
Created attachment 281793 [details]
lspci -k output
Comment 2 bitjam 2019-03-13 08:25:06 UTC
Created attachment 281795 [details]
dmesg output from 2nd user
Comment 3 Bjorn Helgaas 2019-03-13 14:11:45 UTC
No indication yet that this is PCI-related, so reassigning.

I guess you've already tested v5.0?  If not, please test that to make sure it hasn't been fixed already.

Thanks for narrowing this down to something that changed between v4.14.14 and v4.15.0.  Since it's apparently repeatable, it would be possible for someone who can easily reproduce the problem to bisect it to a specific commit.  See https://www.kernel.org/doc/html/v4.14/admin-guide/bug-bisect.html

If available, please attach complete dmesg logs (not just extracts) and "sudo lspci -vvxxx" output from the entire system.
Comment 4 Gorštak 2019-03-25 10:00:16 UTC
No luck with 5.0 neither, same results.
Comment 5 Gorštak 2019-03-25 10:01:41 UTC
Created attachment 282001 [details]
dmesg output from kernel 4.14
Comment 6 Gorštak 2019-03-25 10:02:17 UTC
Created attachment 282003 [details]
dmesg output from kernel 5.0
Comment 7 Gorštak 2019-03-25 10:05:12 UTC
Created attachment 282005 [details]
lspci output

Note You need to log in before you can comment on or make changes to this bug.