Bug 114821 - Frequent and recurring ext4 "bad header invalid magic" errors on a healthy drive
Summary: Frequent and recurring ext4 "bad header invalid magic" errors on a healthy drive
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: ARM Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-17 05:05 UTC by Lukas
Modified: 2016-03-20 20:50 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.18.5+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (22.38 KB, application/octet-stream)
2016-03-18 05:17 UTC, Lukas
Details
syslog (4.93 KB, application/octet-stream)
2016-03-18 05:18 UTC, Lukas
Details

Description Lukas 2016-03-17 05:05:04 UTC
Recurring ext4 "bad header" errors are reported for a healthy, few months old external USB drive (WD MyBook 1230) connected to a Raspberry Pi running Raspbian 7, kernel 3.18.5+. The drive was working fine previously when formatted in NTFS, SMART has no errors logged, drive passes short and long SMART self-tests as well as manufacturer diagnostic tests (WD Data Lifeguard). Another external USB drive (a WD MyBook 1140), formatted in NTFS shows no problems when connected to the same machine. The specific error is:

ext4_ext_check_inode:495: inode #1523: comm ls: pblk 0 bad header/extent: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)

The errors are produced whenever some certain files are accessed and the hard drive produces a distinct (and rather unsettling) clunk every few seconds, which sounds like a head load/unload or something, but the files seem to be read without error.

Running fsck reports and fixes some errors and makes the ext4 partition clean again, but the same errors are produced when the files are read again. The offending files can be moved to another device and back to the same location to stop the error from recurring, but it soon appears when accessing other files.
Comment 1 Christian Kujau 2016-03-17 07:38:19 UTC
"the hard drive produces a distinct clunk every few seconds" sounds like hardware problems. Other file systems have different access patterns and it may not be vis^W audible with e.g. NTFS, but I wouldn't trust a hard disk that makes "clunk" noises. That being said, there are crucial parts missing from this report: syslog errors during the access problems (a full dmesg may be helpful too); does it happen with a current vanilla kernel? Also: has this been reported to the linux-ext4 mailing list prior opening this bug?

Also, this error seems to be rather common with Raspberry environments, see:

https://www.linuxquestions.org/questions/slackware-arm-108/raspberry-pi2-keeps-crashing-and-corrupting-disk-4175541613/  (May 2015)

https://github.com/atom/atom/issues/8381 (Aug 2015)

https://unix.stackexchange.com/questions/269609/recurring-ext4-bad-header-invalid-magic-errors-on-a-healty-drive (from 3 days ago, is that you?)
Comment 2 Lukas 2016-03-18 05:17:17 UTC
I was suspecting hardware problems initially, but now I don't think that's the case - both short and long SMART self-tests as well as vendor diagnostic tests (including a 10 hour scan of every sector) produced no errors. All SMART attributes that would indicate electromechanical failure are at zero raw values.

I've also just verified that the 'clunk sound' is a head load cycle. By copying a certain file I can trigger the error and if I check the SMART attributes before and after, I see the 'Load Cycle Count (ID #193)' increased by 2. Currently it is at 1823.

Unfortunately, I'm not skilled enough to quickly test with a vanilla kernel on the same OS. Would it be useful if I'd tested using Raspbian Jessie with 4.1 kernel instead? I can exchange the SD card with the OS while keeping everything else the same.

The stackexchange post is indeed mine and I have not posted this to the linux-ext4 mailing list. Should I do so? I apologize if I'm doing things out of order here, I have little experience in official linux bug reporting.

Attaching dmesg ans syslog files.
Comment 3 Lukas 2016-03-18 05:17:52 UTC
Created attachment 209701 [details]
dmesg
Comment 4 Lukas 2016-03-18 05:18:18 UTC
Created attachment 209711 [details]
syslog
Comment 5 Theodore Tso 2016-03-18 22:46:27 UTC
No file system activity should be causing a head load cycle in the absence of a hardware problem.

Any chance you can try copying the file system image to another external usb drive, run e2fsck on the copied file system image, and then try using the copied file systme image on the new usb drive?
Comment 6 Lukas 2016-03-20 06:46:35 UTC
I agree, the head cycling does seem unlikely to be caused by software. But I'm confused how can the drive be 'failing' simple reads so often and yet pass every fitness, including a vendor diagnostic test of every sector...

I cannot create a drive image... It is 4TB in size and I don't have that much contiguous free storage at my disposal. What I could do is wipe the drive and re-create the ext4 partition. I have most of the contents backed up. It sounds like a radical way of testing, but let me know if that's my only option.

In the mean time I've done some more testing. Maybe it will prove useful.

Firstly, I've reproduced the error on Raspbian 8 with 4.1.18-v7+ kernel running on a different Raspberry Pi device. (Sorry, the kernel is again non-vanilla). Upon accessing a particular file on the drive, I can hear the head load cycle and a dmesg ext4 error is logged.

Secondly, I've noticed that most of the ext4 errors are due to smbd and for the same inode #1523. That inode seems to be accessed quite often whenever I do anything with the drive.

I ran

$ debugfs -R "ncheck 1523" /dev/sdb1

and determined that the inode holds a single text file in the root directory of the drive - 'file.txt'. I cannot list or delete this file:

% ls -li

ls: cannot access file.txt: Input/output error
? -?????????  ? ?     ?               ?            ? file.txt

$ rm file.txt
rm: cannot remove `file.txt': Input/output error

If I execute

$ debugfs -R 'cat <1523>' /dev/sdb1

it returns nothing and I believe the file should be empty. The command takes a few minutes to execute, which I don't think is normal too.

I know this sounds even more like hardware problems, but, again, there's nothing in the SMART attributes, self-tests or vendor diagnostic tests to support that.

Should I attempt to delete this file using debugfs? Or should I re-create the partition from scratch? Or should I just get another drive, close this 'bug' and stop troubleshooting?
Comment 7 Andreas Dilger 2016-03-20 15:12:38 UTC
Probably better than deleting the file, which would just release the bad block(s) to be allocated by some other file, is to use "debugfs -c -R 'stat <1523>' /dev/sdb1" to list the blocks allocated to the file, and then run "badblocks" to mark those blocks bad.  This might also be the inode block that is bad, which you could tell by listening for the head jump when you run "debugfs stat" on the inode.  I don't think ext4 or e2fsck currently handles bad blocks in the middle of the inode table, though I guess they could by permanently marking those inodes allocated when parsing the bad block numbers. 

One way to check which is the bad block(s), if it isn't the inode itself, is once you isolate the file's block number(s) is to run "dd if=/dev/sdb1 of=/dev/null bs=4k count=1 skip={blocknr}" one at a time for each block and listen for the head jump.  This may be complicated by readahead in both the drive and the kernel, so it may be better to do this from higher to lower block numbers.
Comment 8 Lukas 2016-03-20 20:50:12 UTC
The command

$ debugfs -c -R 'stat <1523>' /dev/sdb1

produces the following output:

Inode: 1523   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 0    Version: 0x00000000
User:  1001   Group:   100   Size: 0
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x56b66f93 -- Sat Feb  6 17:11:31 2016
atime: 0x56b66f93 -- Sat Feb  6 17:11:31 2016
mtime: 0x4e4d75ac -- Thu Aug 18 16:27:24 2011
Size of extra inode fields: 0
EXTENTS:
(END)

Executing 'stat <1523>' command does not cause a head load cycle.

As I understand, there are no blocks assigned to the inode. The file is an empty text file, so maybe it makes sense? I've created a new empty text file and it also has no blocks assigned to its inode, but it can be read and stat'ed with no problems.

I've checked a few of the other inode numbers that appear in dmesg less frequently and - all of them point to empty text files! I would guess the reason why inode #1523 appears so often in the logs is because the file is in the root directory and gets listed each time the root is traversed.

So, is it possible that something went wrong with just the zero-length files while I was copying my data to it? I did use ext2fsd on Windows 8.1 to copy some of the files... I just tested it again, and I can copy zero-length files to the drive using Windows and read them when mounted on linux with no problems.

And since the zero-length files have no useful information (other than their file names, which I can still read) and have no blocks assigned to them can I now just remove them using debugfs?

Note You need to log in before you can comment on or make changes to this bug.