This is my final attempt at offering help / seeking help after 6-months of trying. Just guessing at Product/Component, please modify. Brief: When in buggy kernel, most nights around 5:25-6:30am will have about 20-50 oom-killers invoke and log and then usually result in a system hang (no numlock light toggle, no alt-sysrq). Rarely, it is not totally dead and I can recover by rapidly killing things and issuing "reboot". Rarely, it has exhibited the beginnings of the behaviour while I am working during normal hours. 3.11.10-200.fc19.i686.PAE does not have this bug. Every kernel I have tried after that bug, up until today's Fedora 3.14.17 version, has the bug. If I boot into any >3.11 kernel and wait 1 or at most 2 nights, the bug hits. 3.11 *NEVER* has this bug, ever, not once. I run 3.11 most of the time now, only booting into new 3.14.x's to see if anything has been fixed. After scouring the web for any hint of similar bugs, and watching replies to my RHBZ, I have a major hunch it is PAE related. This would explain why almost no one else sees it. It also may be md RAID related. There were some other PAE RAID bugs (slab?) fixed in the LKML that seemed extremely relevant, and they noted there were probably more bugs of this nature. Bug seems to possibly be related to IO. It nearly always hits during a nightly cron job of mine that does a "find -xdev / > /mnt/some-nfs-server/file-list". That job usually does not finish. If I run that job interactively during the day, the bug does not trigger! But at night via cron that seems to trigger it. Other responders have told me they get hit "during backups". I can better describe how the system feels and behaves when the bug hits, if required, for the few times it happened while I was present. I have 8GB RAM and 8GB swap on an Intel X75 chipset (ECC). When the bug hits, sometimes I have not too much running, usually only 1-2GB RAM used by ps's. Swap is almost always 99% empty. I have tried sysctl'ing the system to more aggressively swap but it does not help at all. System is lightly/modestly loaded. It is a GUI workstation that runs quite a few daemons also. For the gory details, including snapshots of key /proc files every minute before/during this bug hitting, see RHBZ https://bugzilla.redhat.com/show_bug.cgi?id=1075185 I have more of those snapshots saved, and can provide them on request. I have the /v/l/messages from them too, which gives a lot of detail. I have had a RHBZ (above link) open on this since March. Besides a couple of maybe-me-toos, no one has commented except Josh Boyer who told me to quit PAE and use 64-bit (easier said than done for this computer, since I'd want full 64-bit userland in that case). I also posted to the LKML to the sounds of crickets. Zack's Kernel News recently indicated PAE was kind of a pariah and perhaps not even having patches accepted for it anymore? Linus seems to hate it. OK, fine, so I've made the resolve to take the day of pain and switch to a 64-bit reinstall, but I wanted to offer one last time the unique ability to provide any snapshots / data / logs required to help solve this bug for other people's sake. No one else seems able to reproduce the bug as "on demand" (nightly) as I can. I can also test any possibly-fixed kernels rapidly. If no one cares, as Josh says, then fine, say so (or say nothing for about a week after which I'll be 64-bit and unable to help). Even better, if PAE is to be a pariah that is Officially Intentionally Buggy, then that should be announced to the world, so it can be deprecated or abolished and people won't get the impression it's a normal, supported variant like i386 and x86_64. Note: I wanted to bisect on my own but (as per my RHBZ) I am a n00b who has no idea how to reconcile the "Fedora way" of making a kernel with the vanilla kernel git releases. I found a web page saying how to, but its instructions hit a python bug only present on 32-bit (all also in the RHBZ)! Catch-22. Thanks, as I really hope I can be of further assistance solving this before I give up after 6 months of buggy kernels.
Bug still exists as of 3.14.22-100.fc19. Had it crash while using it at 5:25am during the nightly cron/find (I was awake). X oom'd first and I lost control of everything (black screen). Virt console switching wouldn't work. Numlock was frozen on. Floppy disk light was frozen on(!!). However, the alt-sysrq keys I tried got logged to disk! However, the alt-sysrq reboot attempt didn't work as the system seems to have gone dead right before that (no more logs after after 10 oom's in 30s). I still haven't had time to switch to 64-bit as I discovered all my "easy" (read: contrived and wacky) "upgrade" options were impossible with the recent Fedoras. If anyone wants to help debug this I can still do it... System still works 100% ok bug-free in 3.11.
But still exists as of 3.17.7-200.fc20.i686+PAE. Just upgraded to F20 and its newest kernel hoping it has been fixed, but within 24 hours the bug hit (at 5:22am again). I also saw a precursor to the bug within an hour of booting F20 doing a simple find . -newer on my home dir (83k files). I could tell the whole system bogged down, and just switching windows I could see the frames draw slowly until the find was done. In fact, it seemed boggy even about 10-20s after the find was done. Weird.
Created attachment 162391 [details] oom data from /v/l/messages