Bug 219300 - ext4 corrupts data on a specific pendrive
Summary: ext4 corrupts data on a specific pendrive
Status: RESOLVED INVALID
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-09-22 15:47 UTC by nxe9
Modified: 2024-09-24 16:15 UTC (History)
1 user (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description nxe9 2024-09-22 15:47:48 UTC
Hi, copying data to a specific pendrive with the ext4 file system does not work correctly, i.e. the data is damaged after copying. My observations lead me to believe that this is caused by some bug in the Linux kernel. Below I will list all relevant observations.

Steps to reproduce:
1. Create an ext4 filesystem using any kernel >=5 (<5 not tested) on a specific pendrive model. Pendrive: Intenso Speed Line, idVendor=346d, idProduct=5678, 31.5 GB/29.3GiB 
2. Copy at least a few GB of data in the form of several files to the mentioned pendrive. E.g. at least five files of 1 GB each.
3. Compare the checksums of the files on the host and on the flash drive.
4. At least some files are inconsistent. If not, then unmount and remount the file system or restart your computer and check the checksums again.

Counterexample:
1. Do the same as above, this time with the ntfs instead of ext4.
2. All files are always consistent.

My observations:
- The problem occurs every time I copy at least a few GB of data.
- The problem occurs on various Linux operating systems (gentoo kernel 6.6.47, 6.6.38, arch kernel 5.x, arch kernel 6.10.7, ubuntu 24.04 LTS kernel 6.8.0-41-generic). So I assume that the problem has been present for a long time and probably also in the latest version.
- I notice a difference between older kernels and version 6.10.7 (arch linux). In the case of 6.10.7, the problem does not occur immediately, but only after remounting the files or restarting the computer.
- I verify the data using crc32 or sha256 checksum.
- I tested on two different machines.
- The host has been tested by memtest. There were no errors.
- The problem concerns a specific pendrive model. I have two physical pendrives of the exact same model and both of them have this problem. Other models, even from the same manufacturer, do not cause the problem. Models that cause the problem: Intenso Speed Line, idVendor=346d, idProduct=5678, 31.5 GB/29.3GiB 
- The problem is not because I unmounted the device incorrectly or removed the pendrive too quickly. 
- Below is an example of dmesg output.
- Typically, only the data gets corrupted when copied. However, sometimes the entire file system crashes. Below is an example from dmesg.
- The problem occurs in both USB 2 and USB 3 slots.
- Corrupt data is not the same every time. I.e. by copying the data twice, I get two different checksums on the flash drive. The number of corrupted files also varies.
- One might assume that the problem is the poor quality of the pendrive model, but the problem does not occur at all on ntfs. Ntfs always works fine. Both on Windows and various Linux distributions.
- Copying to ntfs takes a short time. ext4 is over 10 times slower than ntfs for this model.
- f2fs also corrupted the data, while extFAT did not. However, I have not tested these file systems extensively.
- I looked for help on gentoo forum, but they were unable to help me there. There is a discussion on this topic in the link below, but I have summarized everything important here. https://forums.gentoo.org/viewtopic-t-1170536.html

It seems that ntfs can handle this hardware correctly, but ext4 has some problem. 


Sample dmesg output during data corruption: 
[20904.194233] usb 2-4: new high-speed USB device number 3 using xhci_hcd
[20904.322059] usb 2-4: New USB device found, idVendor=346d, idProduct=5678, bcdDevice= 2.00
[20904.322076] usb 2-4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[20904.322083] usb 2-4: Product: Intenso Speed Line
[20904.322090] usb 2-4: Manufacturer: Intenso
[20904.322094] usb 2-4: SerialNumber: FC<replaced...>
[20904.323170] usb-storage 2-4:1.0: USB Mass Storage device detected
[20904.323543] scsi host6: usb-storage 2-4:1.0
[20905.374792] scsi 6:0:0:0: Direct-Access     Intenso  Speed Line       2.00 PQ: 0 ANSI: 4
[20905.375139] sd 6:0:0:0: Attached scsi generic sg1 type 0
[20905.376508] sd 6:0:0:0: [sdb] 61440000 512-byte logical blocks: (31.5 GB/29.3 GiB)
[20905.376780] sd 6:0:0:0: [sdb] Write Protect is off
[20905.376786] sd 6:0:0:0: [sdb] Mode Sense: 03 00 00 00
[20905.376921] sd 6:0:0:0: [sdb] No Caching mode page found
[20905.376924] sd 6:0:0:0: [sdb] Assuming drive cache: write through
[20905.389018]  sdb: sdb1
[20905.389331] sd 6:0:0:0: [sdb] Attached SCSI removable disk
[20931.947073]  sdb: sdb1
[20931.969695]  sdb: sdb1
[20977.720825] EXT4-fs (sdb1): mounted filesystem 28b0a704-e5b8-4dee-aab5-316b73b481a4 r/w with ordered data mode. Quota mode: none.
[21159.649524] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[21165.264260] usb 2-4: device descriptor read/64, error -110
[21329.633511] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[21335.248339] usb 2-4: device descriptor read/64, error -110
[21506.786497] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[21512.400341] usb 2-4: device descriptor read/64, error -110
[21858.529543] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[21864.144299] usb 2-4: device descriptor read/64, error -110
[22010.598453] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[22016.209332] usb 2-4: device descriptor read/64, error -110
[22402.785528] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[22408.400301] usb 2-4: device descriptor read/64, error -110
[22542.562424] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[22548.176319] usb 2-4: device descriptor read/64, error -110
[22658.273592] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[22663.888296] usb 2-4: device descriptor read/64, error -110
...
[23482.082529] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[23487.697333] usb 2-4: device descriptor read/64, error -110
[23776.993521] usb 2-4: reset high-speed USB device number 3 using xhci_hcd
[23782.608362] usb 2-4: device descriptor read/64, error -110


Another sample dmesg output during data corruption: 
[31547.744532] usb 4-4: new SuperSpeed USB device number 2 using xhci_hcd
[31547.757338] usb 4-4: LPM exit latency is zeroed, disabling LPM.
[31547.758379] usb 4-4: New USB device found, idVendor=346d, idProduct=5678, bcdDevice= 2.00
[31547.758390] usb 4-4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[31547.758394] usb 4-4: Product: Intenso Speed Line
[31547.758398] usb 4-4: Manufacturer: Intenso
[31547.758401] usb 4-4: SerialNumber: FC<replaced...>
[31547.759224] usb-storage 4-4:1.0: USB Mass Storage device detected
[31547.759634] scsi host6: usb-storage 4-4:1.0
[31548.766861] scsi 6:0:0:0: Direct-Access     Intenso  Speed Line       2.00 PQ: 0 ANSI: 4
[31548.767176] sd 6:0:0:0: Attached scsi generic sg1 type 0
[31548.768220] sd 6:0:0:0: [sdb] 61440000 512-byte logical blocks: (31.5 GB/29.3 GiB)
[31548.768375] sd 6:0:0:0: [sdb] Write Protect is off
[31548.768380] sd 6:0:0:0: [sdb] Mode Sense: 03 00 00 00
[31548.768503] sd 6:0:0:0: [sdb] No Caching mode page found
[31548.768507] sd 6:0:0:0: [sdb] Assuming drive cache: write through
[31548.777093] Alternate GPT is invalid, using primary GPT.
[31548.777106]  sdb: sdb1
[31548.777407] sd 6:0:0:0: [sdb] Attached SCSI removable disk
[31562.696274]  sdb: sdb1
[33457.510450]  sdb: sdb1
[33457.532279]  sdb: sdb1
[33555.769208] EXT4-fs (sdb1): mounted filesystem 6660ff4e-c384-405c-be0e-86737a393344 r/w with ordered data mode. Quota mode: none.
[33986.273861] usb 4-4: reset SuperSpeed USB device number 2 using xhci_hcd
[33987.553302] usb 4-4: LPM exit latency is zeroed, disabling LPM.
[34132.705880] usb 4-4: reset SuperSpeed USB device number 2 using xhci_hcd
[34133.691058] usb 4-4: LPM exit latency is zeroed, disabling LPM.
[34734.306884] usb 4-4: reset SuperSpeed USB device number 2 using xhci_hcd
[34735.012621] usb 4-4: LPM exit latency is zeroed, disabling LPM.
[34769.121882] usb 4-4: reset SuperSpeed USB device number 2 using xhci_hcd
[34769.838692] usb 4-4: LPM exit latency is zeroed, disabling LPM.
[35411.681919] usb 4-4: reset SuperSpeed USB device number 2 using xhci_hcd
[35411.771220] usb 4-4: LPM exit latency is zeroed, disabling LPM.
[35447.009831] usb 4-4: reset SuperSpeed USB device number 2 using xhci_hcd
[35447.944211] usb 4-4: LPM exit latency is zeroed, disabling LPM.


Sample console/dmesg output when the entire filesystem is corrupted:
cp: error writing '<replaced...>': Input/output error 
cp: cannot create regular file '<replaced...>': Read-only file system 
cp: cannot create regular file '<replaced...>': Read-only file system 
cp: cannot create regular file '<replaced...>': Read-only file system 
  
... 
  
[ 8202.825924] EXT4-fs (sdb1): mounted filesystem 84c42b25-807a-494f-a8de-bbb280c21d38 r/w with ordered data mode. Quota mode: none. 
[ 8207.481253] EXT4-fs error (device sdb1): ext4_validate_block_bitmap:421: comm ext4lazyinit: bg 176: bad block bitmap checksum 
[ 8228.651866] EXT4-fs (sdb1): unmounting filesystem 84c42b25-807a-494f-a8de-bbb280c21d38. 
[ 8237.434827] EXT4-fs (sdb1): warning: mounting fs with errors, running e2fsck is recommended 
[ 8237.435636] EXT4-fs (sdb1): mounted filesystem 84c42b25-807a-494f-a8de-bbb280c21d38 r/w with ordered data mode. Quota mode: none. 
[ 8238.993344] EXT4-fs error (device sdb1): ext4_validate_block_bitmap:421: comm ext4lazyinit: bg 176: bad block bitmap checksum 
  
... 
  
[ 8557.663116] EXT4-fs (sdb1): error count since last fsck: 3 
[ 8557.663137] EXT4-fs (sdb1): initial error at time 1725382598: ext4_validate_block_bitmap:421 
[ 8557.663148] EXT4-fs (sdb1): last error at time 1725383358: ext4_validate_block_bitmap:421 
  
... 
  
[11843.298598] usb 2-2: reset high-speed USB device number 2 using xhci_hcd 
[11844.103802] usb 2-2: device firmware changed 
[11844.103922] usb 2-2: USB disconnect, device number 2 
[11844.111282] device offline error, dev sdb, sector 60278752 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2 
[11844.111301] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 17 writing to inode 20 starting block 7534844) 
[11844.111331] device offline error, dev sdb, sector 60286976 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 2 
[11844.111377] device offline error, dev sdb, sector 60287216 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 2 
[11844.111399] device offline error, dev sdb, sector 60287456 op 0x1:(WRITE) flags 0x4000 phys_seg 3 prio class 2 
[11844.111425] device offline error, dev sdb, sector 60287696 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 2 
[11844.111434] device offline error, dev sdb, sector 60287936 op 0x1:(WRITE) flags 0x4000 phys_seg 3 prio class 2 
[11844.111450] device offline error, dev sdb, sector 60288176 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 2 
[11844.111460] device offline error, dev sdb, sector 29749456 op 0x1:(WRITE) flags 0x9800 phys_seg 10 prio class 2 
[11844.111492] device offline error, dev sdb, sector 60288416 op 0x1:(WRITE) flags 0x4000 phys_seg 3 prio class 2 
[11844.111504] Aborting journal on device sdb1-8. 
[11844.111509] device offline error, dev sdb, sector 60288656 op 0x1:(WRITE) flags 0x4000 phys_seg 2 prio class 2 
[11844.111522] Buffer I/O error on dev sdb1, logical block 3702784, lost sync page write 
[11844.111521] EXT4-fs error (device sdb1) in ext4_reserve_inode_write:5787: Journal has aborted 
[11844.111523] EXT4-fs error (device sdb1) in ext4_reserve_inode_write:5787: Journal has aborted 
[11844.111534] EXT4-fs error (device sdb1): ext4_convert_unwritten_extents:4849: inode #20: comm kworker/u16:2: mark_inode_dirty error 
[11844.111539] EXT4-fs error (device sdb1): ext4_dirty_inode:5991: inode #21: comm cp: mark_inode_dirty error 
[11844.111543] JBD2: I/O error when updating journal superblock for sdb1-8. 
[11844.111545] EXT4-fs error (device sdb1) in ext4_convert_unwritten_io_end_vec:4888: Journal has aborted 
[11844.111551] EXT4-fs error (device sdb1) in ext4_dirty_inode:5992: Journal has aborted 
[11844.111553] EXT4-fs (sdb1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 20, error -30) 
[11844.111565] Buffer I/O error on device sdb1, logical block 7533568 
[11844.111574] Buffer I/O error on device sdb1, logical block 7533569 
[11844.111577] Buffer I/O error on device sdb1, logical block 7533570 
[11844.111580] Buffer I/O error on device sdb1, logical block 7533571 
[11844.111583] Buffer I/O error on device sdb1, logical block 7533572 
[11844.111586] Buffer I/O error on device sdb1, logical block 7533573 
[11844.111588] Buffer I/O error on device sdb1, logical block 7533574 
[11844.111591] Buffer I/O error on device sdb1, logical block 7533575 
[11844.111594] Buffer I/O error on device sdb1, logical block 7533576 
[11844.111596] Buffer I/O error on device sdb1, logical block 7533577 
[11844.111757] EXT4-fs error (device sdb1): ext4_journal_check_start:84: comm kworker/u16:1: Detected aborted journal 
[11844.111788] EXT4-fs warning (device sdb1): ext4_end_bio:343: I/O error 17 writing to inode 21 starting block 7536892) 
[11844.111807] Buffer I/O error on dev sdb1, logical block 0, lost sync page write 
[11844.111816] EXT4-fs (sdb1): I/O error while writing superblock 
[11844.111819] EXT4-fs (sdb1): Remounting filesystem read-only 
[11844.111822] EXT4-fs (sdb1): ext4_do_writepages: jbd2_start: 1024 pages, ino 13; err -30 
[11844.112653] Buffer I/O error on dev sdb1, logical block 0, lost sync page write 
[11844.112670] EXT4-fs (sdb1): I/O error while writing superblock 
[11845.532320] usb 2-2: new high-speed USB device number 3 using xhci_hcd 
[11845.659894] usb 2-2: New USB device found, idVendor=ffff, idProduct=5678, bcdDevice= 2.00 
[11845.659906] usb 2-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 
[11845.659910] usb 2-2: Product: 䍆㈳㔶 
[11845.659913] usb 2-2: Manufacturer: 楆獲t档灩 
[11845.659917] usb 2-2: SerialNumber: 012345678901 
[11845.661033] usb-storage 2-2:1.0: USB Mass Storage device detected 
[11845.661414] scsi host7: usb-storage 2-2:1.0 
[11867.362604] usb 2-2: reset high-speed USB device number 3 using xhci_hcd
Comment 1 Artem S. Tashkinov 2024-09-22 18:58:25 UTC
> [11844.111565] Buffer I/O error on device sdb1, logical block 7533568 
> EXT4-fs (sdb1): I/O error while writing superblock 

Typically, such errors indicate a storage failure, not a filesystem problem.

I strongly suspect your media is broken or damaged and should not be used to store important information.

The easiest way to test it would be to use badblocks with a single pass, using the `-w     Use write-mode test` option.

The defaults for -b and -c are quite low, I'd suggest:

sudo badblocks -b 4096 -c 1000 -w -s -v /dev/sdX
Comment 2 Artem S. Tashkinov 2024-09-22 18:59:33 UTC
Note that this operation will destroy all your data and in your case that would be 

`/dev/sdb`

Please triple check before running the command to avoid data loss.
Comment 3 nxe9 2024-09-23 01:35:52 UTC
>Typically, such errors indicate a storage failure, not a filesystem problem.

>I strongly suspect your media is broken or damaged and should not be used to
>store important information.

How can you explain the fact that I can copy tens of GB of data to the ntfs file system on different operating systems and no errors occur and data is always consistent? For me, this is a sign that something is wrong with ext4 since ntfs works without any problems on the same hardware.

I've tested badblock before and there were no errors.
badblocks -w -s -o error.log /dev/sdX
Comment 4 nxe9 2024-09-23 01:39:23 UTC
In short, in the case of ext4 I can generate an error very quickly. In the case of ntfs, I was unable to generate it even once.
Comment 5 Theodore Tso 2024-09-23 06:26:35 UTC
Ext4 uses a block allocation algorithm which spreads the blocks used by files across the entire storage device in order to reduce file fragmentation.   There are cheap thumb drives that claim to be, say, 16GB, but which only have 8GB of flash, and they rely on the fact that some Windows file systems (FAT and NTFS) allocates blocks starting at the low-numbered block numbers, and so if there is a fake/scammy USB thumb drive (the kind that you buy in the back alley of Shenzhen, or at a deap discount in the checkout line of Microcenter, or a really dodgy vendor on Amazon Marketplace at a price which is too good to be true), it might work on Windows so long as you don't actually try to store that many files on it.

In any case, the console messages are very clearly I/O errors and the LBA sector number reported is a high-numbered address: 60278752.    Whether this is just a failed thumbdrive, or one which is deliberately sold as a fake is unclear, but I would suggest trying to read and write to all of the sectors of the disk.   Fundamentally, ext4 assumes that the storage device is valid; and if it is not valid (e.g., has I/O errors when you try to read or write to portions of the disk), that's the storage device's problem, not ext4.
Comment 6 Artem S. Tashkinov 2024-09-23 07:07:08 UTC
> and so if there is a fake/scammy USB thumb drive

AliExpress has hundreds of them.

Some are even sold as "2TB" drives when in reality you'll be lucky if they contain 16GB of disk space. Tons of reviews on YouTube as well.
Comment 7 nxe9 2024-09-23 16:24:59 UTC
Thank you for your entries. My pendrive is not a Chinese fake and I think size is not correct. At least that's what I think. Intenso is a German company, although the chips are probably imported from the Far East.

Back to the topic...

I don't know much about file systems, so I'm relying on you. Is it likely that the file systems are so different that a hardware bug is visible regularly on one file system but is impossible to reproduce on the other? Besides, the fact is that two pendrives of the same model have the problem, and other models, even from the same manufacturer, do not. If I could see the error on ntfs just once, I wouldn't have a problem, but so far I haven't been able to reproduce the error on ntfs even once. Today I tested ntfs again with f3 and as usual no error. Apart from that I generated test data and filled the disk completely. As usual, all fully consistent on ntfs.

Freespace on ext4 according to f3write: Free space: 28.67 GB
Freespace on ntfs according to f3write: Free space: 29.23 GB

As you can see, I can write even more data to ntfs and it will not generate errors.

I will summarize some points:
- i/o errors in dmesg appear very rarely. During data corruption this error usually does not appear.
- f3 tests on ext4 are negative only sometimes.
- when copying my own files to ext4 I can generate data inconsistency very quickly.
- badblocks doesn't show me any errors.
- ntfs always works great

Therefore, I am still interested in whether one file system can actually hide hardware defects (or is implemented in such a way that it is very difficult to reproduce) or maybe the other file system has some rare bug that will only become visible in the case of this hardware. For me it's not settled.
Comment 8 Artem S. Tashkinov 2024-09-23 16:30:03 UTC
2 billion Android users use ext4 daily with zero issues.

I/O errors must not appear EVER, I repeat a normally working mass storage device should NEVER produce a single one of them.

In fact if I get a single IO error on any of my devices, it instantly gets wiped and thrown in the trash.

You can tell a FS that certain blocks are bad but if you value your sanity you should not be using such storage.

Please ask your question on either:

https://unix.stackexchange.com/questions or https://superuser.com/questions/

It does not belong here.
Comment 9 Theodore Tso 2024-09-23 18:53:02 UTC
It's not at all surprising that flaky hardware might have issues that are only exposed on different surprising.   Different file systems might have very different I/O patterns both in terms of spatially (what blocks get used) and temporal (how many I/O requests are issued in parallel, and how quickly) and from a I/O request type (e.g., how much if any CACHE FLUSH requests, how many if any FORCED UNIT ATTENTION -- FUA).

One quick thing I'd suggest that you try is to experiment with file systems other than ext4 and ntfs.  For example, what happens if you use xfs or btrfs or f2fs with your test programs?    If the hardware fails with xfs or btrfs, then that would very likely put the finger of blame on the hardware being cr*p.

The other thing that you can try is to run tests on the raw hardware.   For example, something like this [1]to write random data to the disk, and then verify the output.   The block device must be able to handle having random data written at high speeds, and when you read back the data, you must get the same data written back.   Unreasonable, I know, but if the storage device fails with random writes without a file system in the mix, it's going to be hopeless once you add a file system.

[1] https://github.com/axboe/fio/blob/master/examples/basic-verify.fio

I will note that large companies that buy millions of dollars of hardware, whether it's for data centers use at hyperscaler cloud companies like Amazon or Microsoft, or for Flash devices used in mobile devices such as Samsung, Motorola, Google Pixel devices, etc., will spend an awful lot of time qualifying the hardware to make sure it is high quality before they buy them.  And they do this using raw tests to the block device, since this eliminates the excuse from the hardware company that "oh, this must be a file system bug".    If there are failures found when using storage tests against the raw block device, there is no place for the hardware vendor to hide.....

But in general, as Artem said, if there are any I/O failures at all, that's a huge red flagh.   That essentially *proves* that the hardware is dodgy.   You can have dodgy hardware without I/O errors, but if there are I/O errors reading or writing to a valid block/sector number, then by definition the hardware is the problem.   And in your case, the errors are "USB disconnect" and "unit is off-line".   That should never, ever happen, and if it does, then there is a hardware problem.  It could be a cabling problem; it could be a problem with the SCSI/SATA/NVME/USB controller, etc., but the file system folks will tell you that if there are *any* such problems, resolve the hardware problem before you asking the file system people to debug the problem.    It's much like asking a civil egnineer to ask why the building might be design issues when it's built on top of quicksand.  Buildings assume that they are built on stable ground.   If the ground is not stable, then chose a different building site or fix the ground first.
Comment 10 nxe9 2024-09-24 16:15:10 UTC
OK, thanks. You convinced me.

@Theodore Tso: Thank you for your detailed post. 

As I wrote in the first post, i tried f2fs once and it also broke the data. This confirms your claims.

I tried the „basic-veryfy.fio“. Unfortunately, this method is not very practical, because in the case of my pendrive, the verification time is about 60 days. After 10 hours I stopped. The progress was less than one percent. Another properly functioning pendrive would also require many days. Perhaps this method would generate an error, but it is very cumbersome.

From the perspective of the average user, this is not a good situation, because you can operate on hardware that is not fully functional, not be fully aware of it and not have an easy and effective method to verify the status of your device. True, you can also buy hardware from a more reputable manufacturer.

Unfortunately, there's nothing I can do about it. Well, the only thing I can do is throw this equipment in the trash. Thank you again.

Note You need to log in before you can comment on or make changes to this bug.