Created attachment 72287 [details] screenshot of invalid opcode BUG during hibernation Let's open a separate issue for this, maybe it's not the same as bug 13375 after all. Occasionally, during the memory preallocation phase of hibernation, I hit an invalid opcode bug in radix_tree_tag_set, called from __xfs_inode_set_reclaim_tag (see attached screenshot for the full stack trace). The bug isn't easily reproducible, but the two captured stack traces are very similar (you can find the other in attachment 72194 [details]).
I hit it again under 3.2.4 during normal usage (not during hibernation or something esoteric). This also means I can present a textual backtrace now: > ------------[ cut here ]------------ > kernel BUG at lib/radix-tree.c:477! > invalid opcode: 0000 [#1] > > Pid: 19, comm: kswapd0 Not tainted 3.2.4 #2 IBM 1834S5G/1834S5G > EIP: 0060:[<c1195015>] EFLAGS: 00010246 CPU: 0 > EIP is at radix_tree_tag_set+0xa7/0xaf > EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001 > ESI: 00000007 EDI: de6a1a90 EBP: de9bbdcc ESP: de9bbdb0 > DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 > Process kswapd0 (pid: 19, ti=de9ba000 task=de877400 task.ti=de9ba000) > Stack: > 00000000 ddfa8bc0 00000087 de6a1a90 c5712900 ddfa8b80 00000013 de9bbdfc > c112fd92 c1166c0c 00000000 00080000 00000000 ddfa8b80 00000000 ddf53800 > c5712900 ddfa8b80 00000000 de9bbe10 c112fe70 c5712a04 c5712900 c5712a04 > Call Trace: > [<c112fd92>] __xfs_inode_set_reclaim_tag+0x47/0xee > [<c1166c0c>] ? xfs_perag_get+0x1e/0x81 > [<c112fe70>] xfs_inode_set_reclaim_tag+0x37/0x4b > [<c112e27e>] xfs_fs_destroy_inode+0x73/0x96 > [<c10c03dd>] destroy_inode+0x2b/0x45 > [<c10c06f9>] evict+0xb1/0x116 > [<c10c0a78>] dispose_list+0x2b/0x35 > [<c10c0b69>] prune_icache_sb+0xe7/0x1f7 > [<c10b0667>] prune_super+0xf1/0x13c > [<c108f8f2>] shrink_slab+0x19a/0x2be > [<c1090642>] kswapd+0x6d0/0x90f > [<c10412cc>] ? wake_up_bit+0x67/0x67 > [<c108ff72>] ? shrink_zone+0x4de/0x4de > [<c1040f66>] kthread+0x6c/0x6e > [<c1040efa>] ? kthreadd+0xad/0xad > [<c1423436>] kernel_thread_helper+0x6/0x10 > Code: e8 8b 51 04 8b 4d e4 83 c1 18 bb 01 00 00 00 d3 e3 85 d3 75 08 09 d3 8b > 7d e8 89 5f 04 83 c4 10 5b 5e 5f 5d c3 85 c0 74 f4 eb d3 <0f> 0b eb fe 0f 0b > eb fe 55 89 e5 57 56 53 83 ec 44 89 c7 89 55 > EIP: [<c1195015>] radix_tree_tag_set+0xa7/0xaf SS:ESP 0068:de9bbdb0 > ---[ end trace 6e1e4ef9d3733d9d ]---
It's tripped a radix-tree internal bug when setting a tag on the tree indicating that the index is not actually in the tree. Which, AFAICT, shouldn't be possible. There's two possible radix trees that are tagged in this function, How reproducable is it? If you can reproduce it, getting a trace of the xfs_destroy_inode, xfs_perag_get, xfs_perag_set_reclaim and xfs_iget* trace points might help show what inode or AG index it was that died and whether it is actually in the radix tree or not... Dave.
I can't reproduce it reliably, but it seems related to memory pressure (and the stack trace seems to agree with this to my naive eyes -- shrink_*, prune_*, dispose_list, evict, xfs_fs_destroy_inode and xfs_inode_set_reclaim_tag all point to memory reclaim being in progress. No wonder it mostly happens during hibernation, more precisely, during memory preallocation, when half of the memory must be freed for the hibernation image. And last time it happened during starting Firefox, which occupies half of the memory of this machine. If you can suggest a good way to simulate such memory pressure, I'll probably have a better chance reproducing this in a controlled environment. Regarding tracing, I'm not particularly handy with it, so please provide some details about which tracer to enable and how large buffers would be needed, etc. I played with it before, but got mostly confused by the overlapping functionalities. Also, I'm running a self-compiled kernel, so adding some instrumentation would be easy and possibly more efficient. Thanks for your time! Feri.
for tracing, download the trace-cmd program and use it to record and report the events: $ trace-cmd record -e <events> .... ^C $ trace-cmd report > events.txt As to the problem being memory reclaim related - that's not necessarily the case. That's when the problem is tripped over, but it doesn't mean that this is the cause. I'm starting to think that you might have dodgy memory or some kind of memory corruption occurring because we haven't changed anything in the code where the crash is occurring for several releases... Dave.
Thanks for the trace-cmd tip, I'll give that a shot. Compilation under 32-bit gave some format-string mismatch warnings, but hopefully they aren't serious. On the other hand, 5 passes in Memtest-86 gave no errors, so memory problems seem improbable. Feri.