Bug 111961 - ext4 umount - invalid opcode with PLEXTOR PX-128M6G-2242
Summary: ext4 umount - invalid opcode with PLEXTOR PX-128M6G-2242
Status: RESOLVED OBSOLETE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: Intel Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-05 16:16 UTC by Sebastian
Modified: 2017-03-08 10:48 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.4.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
journald (14.29 KB, text/plain)
2016-02-05 16:16 UTC, Sebastian
Details
photo (3.26 MB, image/jpeg)
2016-02-05 16:19 UTC, Sebastian
Details
journalctl | grep ext4 (13.76 KB, application/octet-stream)
2016-02-05 23:11 UTC, Sebastian
Details
journalctl | grep ext4 (5.51 KB, application/octet-stream)
2016-02-05 23:11 UTC, Sebastian
Details
journald output (4.62 MB, application/octet-stream)
2016-02-07 09:15 UTC, Sebastian
Details

Description Sebastian 2016-02-05 16:16:28 UTC
Created attachment 202951 [details]
journald

On shutdown system hangs and prints stack trace. After reboot I had to recover my home partition with fsck, sector count was corrupted (didn't check on data loss yet). I've attached journald log + hdd info.

My partitions are mounted with defaults + noatime flags. The ssd is tied to an acer chromebook 11 C740 running arch linux (no Chromeos).

If you need any more info don't hesitate to ask.

best regards


Sebastian

---
hdparm info:

/dev/sda:

ATA device, with non-removable media
	Model Number:       PLEXTOR PX-128M6G-2242                  
	Serial Number:      P02539111451        
	Firmware Revision:  1.01    
	Transport:          Serial, ATA8-AST, SATA II Extensions, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Used: ATA/ATAPI-7 T13 1532D revision 4a 
	Supported: 8 7 6 5 & some of 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  250069680
	LBA48  user addressable sectors:  250069680
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      122104 MBytes
	device size with M = 1000*1000:      128035 MBytes (128 GB)
	cache/buffer size  = unknown
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 1
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	64-bit World wide name
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	DMA Setup Auto-Activate optimization
	   *	Device-initiated interface power management
	   *	Software settings preservation
	   *	Device Sleep (DEVSLP)
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	Data Set Management TRIM supported (limit 8 blocks)
Security: 
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
		supported: enhanced erase
	6min for SECURITY ERASE UNIT. 6min for ENHANCED SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 500230310063e763
	NAA		: 5
	IEEE OUI	: 002303
	Unique ID	: 10063e763
Device Sleep:
	DEVSLP Exit Timeout (DETO): 100 ms (drive)
	Minimum DEVSLP Assertion Time (MDAT): 10 ms (drive)
Checksum: correct
Comment 1 Sebastian 2016-02-05 16:19:59 UTC
Created attachment 202961 [details]
photo
Comment 2 Theodore Tso 2016-02-05 21:02:00 UTC
Can you replicate this problem?   Or can you take a look for any messages starting with EXT4-fs from the kernel (look in /var/log/messages and/or /var/log/dmesg and/or /var/log/syslog) starting from the time of the system booting up to the kernel BUG_ON.

The reason for the kernel BUG report was the fact that there was a left over inode on the orphan inode list.   For some reason inode 789844 was left on the orphan list, and worse, it has a negative link count:

Feb 05 16:22:42 frodo kernel: EXT4-fs (sda3): Inode 789844 (ffff880071df2c00): orphan list check failed!

Feb 05 16:22:42 frodo kernel: EXT4-fs (sda3): sb orphan head is 789844
Feb 05 16:22:42 frodo kernel: sb_info orphan list:
Feb 05 16:22:42 frodo kernel:   inode sda3:789844 at ffff880071df2cc0: mode 100600, nlink -1, next 0

Unfortunately the logs you have included don't shed any light as to how inode 789844 got into this state.   When the file system was unmounted, it tripped over the dead body (as it were), but the question is when and how was the person killed in the first place (e.g., how did the link count and orphan list get into this state).    Also of interest would be if you have saved the output from fsck, so we can see what it found when it repaired the file system.   It's possible it was just the in-memory link count which got corrupted, or the on-disk link count could have been messed up as well; looking at the fsck output would help determine that.

Thanks!!
Comment 3 Sebastian 2016-02-05 23:09:52 UTC
Thank you very much for the fast response! dmesg output from current boot doesn't include any ext4 message. I don't have a syslog daemon running, but I grepped for fsck and ext4 on journald, hoping that helps.

The problem just occurred once - after manual recovering from systemd recovery shell the system boots fine again . I'll check on the logs and will provide more information in case anything fails again. If you miss something or if I can help you there get back to me anytime.
Comment 4 Sebastian 2016-02-05 23:11:11 UTC
Created attachment 203001 [details]
journalctl | grep ext4
Comment 5 Sebastian 2016-02-05 23:11:44 UTC
Created attachment 203011 [details]
journalctl | grep ext4
Comment 6 Theodore Tso 2016-02-07 03:50:54 UTC
It would be interesting to get the full output from journalctl for the time period between Feb 5 11:00 and 21:00 --- but it looks like the root cause was file system corruption, perhaps caused by a hardware problem.   From around 11:14, it looks like an inode table block got zeroed:

Feb 05 11:14:13 frodo kernel: EXT4-fs error (device sda3): ext4_lookup:1584: inode #790461: comm BrowserBlocking: deleted inode referenced: 789864
Feb 05 11:14:13 frodo kernel: EXT4-fs error (device sda3): ext4_lookup:1584: inode #790461: comm BrowserBlocking: deleted inode referenced: 789850
Feb 05 11:14:13 frodo kernel: EXT4-fs error (device sda3): ext4_lookup:1584: inode #790461: comm BrowserBlocking: deleted inode referenced: 789851
Feb 05 11:14:13 frodo kernel: EXT4-fs error (device sda3): ext4_lookup:1584: inode #790461: comm BrowserBlocking: deleted inode referenced: 789855

It also looks like there was some kind of warning or some other kernel issue that triggered a kernel warning, but since I only have the output of "journalctl | grep ext4", we're not seeing the full stack trace or other kernel warning information:

Feb 05 15:51:25 frodo kernel:  [<ffffffffa0197349>] ext4_rename+0x4c9/0x8a0 [ext4]
Feb 05 15:51:25 frodo kernel:  [<ffffffffa019773d>] ext4_rename2+0x1d/0x30 [ext4]


And this was *before* the kernel BUG  reported at 16:22:

Feb 05 16:22:42 frodo kernel: kernel BUG at fs/ext4/super.c:846!

More concerning is the fact that the file system was fixed after you ran e2fsck manually, and then it was reported clean:

Feb 05 16:30:38 frodo systemd-fsck[197]: /dev/sda3 contains a file system with errors, check forced.


Feb 05 16:33:29 frodo systemd-fsck[204]: /dev/sda3: clean, 102074/5718016 files, 4762060/22869846 blocks

It's now marked clean, so you must have run e2fsck between 16:30 and 16:33.

But then at 20:53 the kernel had found more file system corruption, and it looks like it was another block that got zeroed out:

Feb 05 20:53:39 frodo kernel: EXT4-fs error (device sda3): ext4_find_dest_de:1809: inode #5513136: block 22029326: comm scp: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0

No one else has reported problems where file system blocks are getting zero'ed out, so the primary suspect that I would have is a hardware problem.   So before you do anything else, I'd strongly suggest that you make full backups and that you not store anything precious on the flash device until it's been resolved (but I suspect it might require replacing the flash device).
Comment 7 Sebastian 2016-02-07 09:14:25 UTC
Thanks for the warning, but thankfully I've got no important data on that device and everything backed up. After the issue occurred I recovered my home partition manually from systemd recovery shell. I've disabled acpi for the SATA controller - maybe that's causing the problem. I'll keep an eye on the logs, though. I attach the full journal.

thank you very much for looking into this!
Comment 8 Sebastian 2016-02-07 09:15:22 UTC
Created attachment 203071 [details]
journald output
Comment 9 Sebastian 2016-02-10 17:33:04 UTC
Hi,

just wanted to let you know that I didn't face any error related to ext4 after disabling acpi for the sata controller.

This was my change (custom powertop settings runner started by systemd):

# Enable SATA link power management for host0   
#echo 'min_power' > '/sys/class/scsi_host/host0/link_power_management_policy';
# Enable SATA link power management for host1   
#echo 'min_power' > '/sys/class/scsi_host/host1/link_power_management_policy';


looks fine so far.


best regards


Sebastian
Comment 10 Szőgyényi Gábor 2017-03-06 20:29:04 UTC
Please try to reproduce this bug with latest kernel image.
Comment 11 Sebastian 2017-03-08 10:48:34 UTC
Hi,

I can't as I replaced the SSD in the meantime, therefore I mark this issue as obsolete. Thank you anyways for looking into this.

best regards

Sebastian

Note You need to log in before you can comment on or make changes to this bug.