Created attachment 257635 [details]
Several bug dumps. Some are tainted but I've reproduced it without tainting too
I originally reported this in the RedHat BZ against Fedora 26 (#1473242) but was asked to report it upstream.
I originally noticed it when performing backups using rsync, but it seems to related to doing heavy disk/memory I/O. Originally I was backing up to and from an ext4 file system on top of LUKs and LVM but the problem also occurs with XFS and without LVM or LUKs. It also occurs when simply using cp. This has been happening for sometime... perhaps since the 4.9 or 4.10 series of kernels under F25.
I have changed all sorts of BIOS settings, hard disks, disabled swap as well as performed firmware level tests on each HD. I've also used memtest which has shown no memory errors.
Created attachment 257637 [details]
Output of various hardware cmds, /proc/meminfo, dmidecode, dmesg etc
Is there anything else I can provide to assist this bug?
Kernel is a bit old for us and it's unclear what changes RH have applied.
Do you know what line mm/page_alloc.c:1877 refers to in that kernel? In my vanilla 4.11 tree that line is blank.
Created attachment 257779 [details]
'Patched' page_alloc.c from RH kernel 4.11.10-300
Thanks Andrew. I've uploaded the corresponding page_alloc.c and the referenced line (1877) appears to be
VM_BUG_ON(page_zone(start_page) != page_zone(end_page));
If preferred I can see if I can try and reproduce the issue from the vanilla kernel F26 RPMs produced here?
Created attachment 257799 [details]
Oops trace - reproduced with kernel 4.12.4
Hopefully reproducing it with the vanilla 4.12.4 kernel helps. There's two traces in the file, the second one is tainted but the first is repeated without tainting the kernel.
I'm seeing this exact same bug bug on Fedora 24 4.11.12-100.fc24.x86_64, but the page_alloc.c line number is the same. It occurs soon after a nightly scan starts. A Windows PC is setup to index everything using smb via samba on the Linux box at 3am (don't laugh at me, it's not my box/choice!). System crashes with this kernel bug every couple of weeks. So it's pretty much identical to yours (heavy I/O, etc).
Strangely, the same box didn't have this bug when we were using the i686 32-bit PAE kernel, but we recently "upgraded" the box to 64-bit x86_64 kernel and userspace, and then this bug cropped up.
The box is a normally reliable Xeon server with ECC RAM. Normally I would bisect here, but I can't bisect between 32/64 and I don't have a known "good" 64-bit version, so I'm a bit stymied. Ian, maybe you can bisect?
Ian: as for your taint, we were having problems with the e1000e driver complaining on this same box: search your logs for "Detected Hardware Unit Hang". That error was causing the kernel to taint. We solved it with some ethtool options, and now we're not tainted anymore when your bug hits. If you want I can direct you to the other bz for it. And since we mitigated the e1000e bug we seem to get hangs less often, but they are still present.
(In reply to Trevor Cordes from comment #8)
> Strangely, the same box didn't have this bug when we were using the i686
> 32-bit PAE kernel, but we recently "upgraded" the box to 64-bit x86_64
> kernel and userspace, and then this bug cropped up.
Out of curiosity, how much RAM do you have? I've noticed it's much more likely to Oops when running on a LUKS volume too.
> The box is a normally reliable Xeon server with ECC RAM. Normally I would
> bisect here, but I can't bisect between 32/64 and I don't have a known
> "good" 64-bit version, so I'm a bit stymied. Ian, maybe you can bisect?
Mine is otherwise reliable. Also a Xeon box, but without ECC RAM. I don't have a known good kernel. Just anecdotal evidence of roughly when I figured out what was going on. Older kernels also have an Infiniband bug which I was hit by making them unusable for me too.
> Ian: as for your taint, we were having problems with the e1000e driver
> complaining on this same box: search your logs for "Detected Hardware Unit
Thanks Trevor, the tainting is due to me running VirtualBox on the server which uses out-of-tree kernel modules. VB is one of the main things I run on it which is why some of the oopses are tained but I can also reproduce it without.
The box has 8GB RAM. No LUKS, but we are running md raid.
I can't reproduce the problem when writing to tape, (LTO-3) having written many TBs so far without a problem. The issue appears to mainly affect heavy I/O being written to disk. Large files in particular.
Reproducable with vanilla kernel 4.12.7.
Reproducible with vanilla kernel 4.12.9 but at line 1866 this time.
I have so far been *unable* to reproduce the issue under 3.10.0-327 on EL 7.
Is there any specific debugging information I could present to assist with this?
After spending a bunch of time trying various kernels, here's what I've found
The issue appears at 4.11.0
I cannot seem to reproduce it under 4.10.17
Could any of these be at fault?
Trevor are you able to confirm any of this?
I have seen the same (as far as I can gather) twice this morning on CentOS 7 running their current kernel - 3.10.0-514.26.2.el7.x86_64.
The server is a Cisco UCS B200M4 (blade thing) with 128GB RAM working quite IO-heavily as a Mysql server with data on ext4 filesystem on non-LVM SAN volumes through multipath.
I am trying to figure out if we have been hit by the same earlier, but can't say at the moment.
The page_alloc.c line number is different, but matches almost the same code as Ian's in comment #5, but with a BUG_ON instead og a VM_BUG_ON.
Both my stack traces are very similar to the XFS one from Ian's "Several bug dumps. .." attachement.
There seems to be a relevant discussion (Ubuntu, 3.13 kernel) in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1497428 especially from comment #15 on.
I'll upload 2 kernel BUG outputs, a file with dmidecode and things from proc, and the actual page_alloc.c for the kernel I am running.
Please let me know if I can provide any more useful debugging information. The server is in production, so there are some limitations as to what I can do with it test-wise.
Created attachment 258163 [details]
dmesg output 1
Created attachment 258165 [details]
dmesg output 2
Created attachment 258167 [details]
cpuinfo iomem meminfo dmidecode lspci dmesg_from_boot
Created attachment 258169 [details]
Centos 7 version of page_alloc.c
Sigurd, you might have a different bug, as 3.10 seems quite old to exhibit this newer/ish bug. Unless 3.10 had some patches from 4.x backported that are causing this. I could be wrong.
Ian, I can't speak to those commits and they're pretty greek to me when delving that deep into the internals. However, if you have a known good, you can now bisect. I would help bisect, but my problem box is very hard to reproduce the bug on... I can't do it on demand, I must wait 1-10 days for it to hit, and even then, I'd probably have to wait a month before I can confirm a "good".
Unless you have an idea of how I can trigger this bug on demand, you'll have to bisect. I can help you with the LKML side if you manage to bisect and you don't want to do it.
We can confirm we haven't seen the bug on 4.10 either, but that's useless as we never ran 4.10 in x86_64... so it might have been there. However, I'm inclined to believe what your results are because there's been a lot of bugs creeping into the kernel lately (since 4.8-ish). That's why I switched to 64-bit in the first place! To get away from all the PAE bugs.
I've spent more time trying to figure out why it's hitting so few people (which seems obvious or we'd have a hundred "me toos" here), and why it's only hitting this one box of my many. Since our problem is on high i/o samba access (us as server), I poked around smb.conf and noticed from ages ago we had some stale socket options that probably aren't relevant anymore. So we took them out and we'll see if that mitigates the bug. Could be a red herring (probably is), but worth a try, been 5 days without the bug so far. If you are doing high i/o over the lan, maybe worth a quick check if you're setting lowlevel socket options anywhere.
Issue persists under 4.13.12 yet 4.10.17 remains stable.
Created attachment 274169 [details]
Yet another S1200BTL/S1200BTL BUG
For the record, I believe that I've been seeing the same BUG running Fedora 27 kernel 4.14.16-300.fc27.x86_64 on the same "Hardware name: Intel Corporation S1200BTL/S1200BTL, BIOS S1200BT.86B.02.00.0041.120520121743 12/05/2012" as the reporter of this bug although my hardware is using older firmware/BIOS. I have not yet tested latest upstream linux-4.15.y nor downgrading to linux-4.10.y but will give that a test next based reports in this bug which note that "4.10.17 remains rock solid".
Created attachment 274171 [details]
Reproduced "kernel BUG at mm/page_alloc.c:1909!" using 4.16.0-0.rc1.git1.1.vanilla.knurd.1.fc27.x86_64
I used the Fedora Kernel Vanilla Repositories  to install the latest 4.16.0-0.rc1.git1.1.vanilla.knurd.1.fc27.x86_64 kernel and the "kernel BUG at mm/page_alloc.c:1909!" still occurs. I will test 4.10.x next...
We still have crashes every 2-7 days when we have the job that does heavy I/O turned on. When it's off it's mostly stable.
We've tried replacing the PS twice, added a pure sine-wave UPS, and tested the RAM. Nothing helps.
Ours is an older BIOS:
We were going to try the newest, but it looks like it doesn't help at all from what the others are reporting.
See what you come up with George. I guess if no one else can I can start the laborious process of bisecting going back to 4.10. I'm shocked this bug is taking so long, and shocked more people aren't having the problem, as there are a lot of S1200B* boards out there.
For what it's worth, it got so intolerable for me, that a 6 weeks ago I replaced the motherboard. Everything else is identical, RAM, CPU, disks, OS, case etc. The only difference being the motherboard which is now an ASRock Z77 Pro3. It has been rock solid since and has also been upgraded to F27 (4.14.16). Whether the bug ultimately lies in the kernel or the S1200BTx motherboards & firmware, I can't say for sure, but 4.11.0 seemed to introduce an incompatibility around memory management and that series of motherboard.
So far, after performing one "stress test", v4.10.17 seems stable. Other, later kernels, typically failed after one or two stress tests. I'll provide another update after further stress testing...
Ian: Z77 can't do ECC though... whole point of Xeon is to do ECC.
George: are you able to do a git bisect on vanilla? The box I'm seeing this on is not mine, it's just one I manage, and it's in production and net-facing. I can bisect but it's a real pain for my customer, and possible sec risk running with an older kernel for the weeks it will take.
If you have a sure-fire way to make it crash on demand, please share, as none of us have found it yet (I think). I still need to wait several nights for it to crash usually, after turning the I/O jobs back on. Never ever has crashed on demand, even when I manually run the same jobs. Weird.
Boy it would be nice to solve this bug! Glad to hear it's still likely not h/w at fault at least (in terms of something failing vs some incompatible new kernel code).
Trevor, It may be awhile before I'm able to perform a `git bisect` since I use this system for daily work. I agree that it will be time consuming to `git bisect` this since it does take some time to reproduce the BUG. I have reliably (although somewhat randomly at the most inconvenient times) observed the BUG via two "stress tests": 1) `rsync` backup, and 2) `bitbake` of YoctoProject releases where the YoctoProject metadata, sources and build artifacts consume ~70GiB of disk space, so a lot of I/O is used when building these YoctoProjects. The bug is not always triggered 100% of the time but after running a few `bitbake`s, it inevitably triggers the BUG, which has made for some rather unproductive days for me. Since the system is now stable, it may be awhile before I'm able to find enough free time to bisect it. This could still be a hardware/firmware problem though, e.g. I do notice the following kernel init ACPI "Firmware Bug":
[ 0.503456] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
It's possible that there is some ACPI related compatibility issue which is triggered after v4.10.17?
Trevor: Yes I'm aware. It accepts ECC RAM, it obviously just doesn't use the ECC feature, but I did not want to spend a fortune on a replacement motherboard to find the issue persisted. It was more to highlight what seems a very specific quirk with the S1200BT* motherboards.
Rsyncing a few 100GBs of data, especially to LUKS encrypted partitions would reliably crash my system within an hour or two at most. But there was no specific thing I could find to crash the box instantly.
OK, I have started the bisect. We dialed up the PC/smb scanner that causes the crash for us to run once every hour instead of once every day. It's going to be slow going. The first crash took 3 days. We need at least 9 days to assume a "good", as that is the longest the box seems to run for before crashing.
RHBZ is asking if we can reproduce the bug in 4.15, can someone here confirm the bug exists in 4.15 and update the RHBZ or tell me and I'll do it. I'm the middle of the bisect and don't want to rock the boat by trying 4.15. I doubt it fixes it, but a confirmation would be nice.
13 more steps for bisect, at ~3-9 days a test is ~78 days if it averages out. I'll see about trying some of the trigger ideas listed in these bugs to speed things up.
For what it's worth, I've recently found a relatively trivial test case for triggering the bug. If you have a path which contains a lot of subdirectories and files, simply using `du -sh /path/` will (sometimes) trigger the BUG. I came across this case when trying to determine how much disk space was used for build artifacts, ~1TiB and ~11M inodes, which I save on disk under a scratch build directory. Simply attempting to determine how much space was used via `df` triggered the BUG. Alas, it doesn't appear to be a consistent trigger. :(
Trevor, FYI, I replied to the RHBZ noting that I was able to reproduce this issue when using the latest Fedora 27 kernel.
I personally haven't been able to reproduce this issue. However, one of our customers confirmed that these patches correct the problem for them:
That LKML thread does sound promising. It posits that b92df1de5d289c0b5d653e72414bf0850b8511e0 is the problem. See also c9a1b80daeb50e39f21fd7e62f7224f317ac85f0 which says "Fixes: b92df1de5d28".
I do want to make a couple of brief points: b92df1de does fall within the window we were all guessing of 4.10 to 4.11. It falls somewhere around 4.10-rc8 from what I can gather from git log.
However, my bisect, with 8 steps left, has me on 4.10-rc5 and I'm pretty sure I've eliminated the stuff around rc8 in an earlier step! Though I do find git visualize sometimes orders things strangely. And there was one bisect skip I had to do on an uncompilable, maybe that messed me up.
I'll continue with my bisect just to see if it can confirm that b92df1de is my problem. I'll do that before I test this new patch. Might as well, gone this far. Maybe someone else here can test the new patch on a up to date kernel and report back? Thanks for the tip Chris, and tying that LKML/patch with our bug here.
Oh ya, what hardware are your customers using, Chris? The LKML makes it sound like this can affect myriad h/w and not just S1200Bxx? I'm surprised, as if that's the case, why isn't there more interest in this bug? There still must be something relatively unique to our setups.
I continue with my bisect. I wanted to update, I currently have in my bisect: bad b686784d03e6, good 1ee18329fae9. I double-checked and that excludes b92df1de5d28, which is Feb 22, 2017, and my current bisect bad is Feb 11, 2017. I'm pretty certain of my "bad", which would mean b92df1de5d28 is not my problem, and may not solve this bug (or we have >1 bug, or I messed up the bisect). A confirmation from one of the other guys here applying the LKML patch would be useful.
I'll keep going. Though it looks like now it'll be a bunch of goods until the bisect creeps closer to the Feb 2017 time. And goods are the worst, as they take 10+ days to verify, even with every trick I've come up with from tips here. That means it could still be 1-3 months.
I used the Fedora Kernel Vanilla Repositories  to install 4.16.0-0.rc7.git1.1.vanilla.knurd.1.fc27.x86_64 kernel and can confirm that this BUG is now resolved as of v4.16-rc7. It looks like the following three commits are required for resolving this BUG:
379b03b7fa05 mm/memblock.c: hardcode the end_pfn being -1
864b75f9d6b0 mm/page_alloc: fix memmap_init_zone pageblock alignment
f59f1caf72ba Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
Thanks to all for fixing this!
I can confirm that what George says is true and those patches appear to fix our problem. We've been running 4.15.15-200.fc26.x86_64 for 15 days with our normal load on (if not heavier) and no crashes. Normally it never lasts more than 9 days (average around 5).
As for my bisect, we reached a point where we had 4 "goods" in a row (a bit odd) and started thinking we had incorrectly labelled a "bad". Especially since the commits previously mentioned as the culprits were already outside our bisect window (just barely). So we prematurely quit the bisect and tested the newest patched kernel from Fedora instead. We have the bisect log and can always go back to the bisect later, but it appears to be unnecessary.
I would say this bz is closable, if the original poster wants to confirm. Then we can also close the RHBZ.
(Still no one has described why this bug seems to hit almost no one out there -- only certain specific hardware -- and I was curious for an explanation but the LKML kernel-ese doesn't really help answer that.)
I'm the original poster, so feel free to close it. I've long since replaced the motherboard that was affected by this bug as the system was next to useless. Awesome work to those that persisted.