Bug 71301

Summary: CPU soft lockup
Product: File System Reporter: Cyril Adrian (cyril.adrian)
Component: ext4Assignee: fs_ext4 (fs_ext4)
Status: RESOLVED INVALID    
Severity: normal CC: cyril.adrian, tytso
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.13.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: .config and syslog

Description Cyril Adrian 2014-02-28 16:28:30 UTC
Created attachment 127701 [details]
.config and syslog

During a simple and innocent "du" :-)

The process locked in the kernel, I could not kill it, machine froze at shutdown (I had to turn it off the hard way).

I join the syslog (only starting from the first bug).

Home-compiled kernel, .config joined too.

For more info please ask.

Best regards,

Cyril
Comment 1 Theodore Tso 2014-02-28 20:42:44 UTC
It looks like a wild pointer which corrupted kernel memory.  This can be seen in the device name being garbage in the ext4_error message:

Feb 26 20:11:18 didactylos kernel: [24848.293159] EXT4-fs error (device >fA
,\rE): ext4_map_blocks:582: inode #64881109: block 259531249: comm du: lblock 0 mapped to illegal pblock (length 1)

Also, the BUG_ON in the timer.c:732 is caused by the timer function sb->s_err_report.function being zero.  But this is initialized when the file system is mounted, and it's never changed by the ext4 code.

So the bug looks like it could be anywhere in the kernel, and it was just that the memory that the wild pointer corruption happened to corrupt this time around was in ext4's data structure.

Are you seeing any other problems with your kernel?  Can you reliably reproduce any of them?
Comment 2 Cyril Adrian 2014-03-01 06:53:18 UTC
Nothing reliable, but a lot of memory retention (3.13 succeeds to fill up my 16Gb RAM, e.g. when I do a remote backup using rsync; something never seen before) -- maybe worsened by suspend, but I'm not sure about anything.

Anyway it seems related to 3.13 because I never had any problem with 3.12.

How can I investigate?

Thanks,

Cyril
Comment 3 Cyril Adrian 2014-03-02 13:55:17 UTC
OK, memtest86+ found the problem: faulty RAM. Bug fixed :-)

Best regards,

Cyril