Bug 71301 - CPU soft lockup
Summary: CPU soft lockup
Status: RESOLVED INVALID
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-28 16:28 UTC by Cyril Adrian
Modified: 2014-03-02 13:55 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.13.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
.config and syslog (47.72 KB, application/x-xz)
2014-02-28 16:28 UTC, Cyril Adrian
Details

Description Cyril Adrian 2014-02-28 16:28:30 UTC
Created attachment 127701 [details]
.config and syslog

During a simple and innocent "du" :-)

The process locked in the kernel, I could not kill it, machine froze at shutdown (I had to turn it off the hard way).

I join the syslog (only starting from the first bug).

Home-compiled kernel, .config joined too.

For more info please ask.

Best regards,

Cyril
Comment 1 Theodore Tso 2014-02-28 20:42:44 UTC
It looks like a wild pointer which corrupted kernel memory.  This can be seen in the device name being garbage in the ext4_error message:

Feb 26 20:11:18 didactylos kernel: [24848.293159] EXT4-fs error (device >fA
,\rE): ext4_map_blocks:582: inode #64881109: block 259531249: comm du: lblock 0 mapped to illegal pblock (length 1)

Also, the BUG_ON in the timer.c:732 is caused by the timer function sb->s_err_report.function being zero.  But this is initialized when the file system is mounted, and it's never changed by the ext4 code.

So the bug looks like it could be anywhere in the kernel, and it was just that the memory that the wild pointer corruption happened to corrupt this time around was in ext4's data structure.

Are you seeing any other problems with your kernel?  Can you reliably reproduce any of them?
Comment 2 Cyril Adrian 2014-03-01 06:53:18 UTC
Nothing reliable, but a lot of memory retention (3.13 succeeds to fill up my 16Gb RAM, e.g. when I do a remote backup using rsync; something never seen before) -- maybe worsened by suspend, but I'm not sure about anything.

Anyway it seems related to 3.13 because I never had any problem with 3.12.

How can I investigate?

Thanks,

Cyril
Comment 3 Cyril Adrian 2014-03-02 13:55:17 UTC
OK, memtest86+ found the problem: faulty RAM. Bug fixed :-)

Best regards,

Cyril

Note You need to log in before you can comment on or make changes to this bug.