Bug 110031
Summary: | Large files writes and reads to/from disk slowed down to a crawl | ||
---|---|---|---|
Product: | IO/Storage | Reporter: | reserv0 |
Component: | Other | Assignee: | io_other |
Status: | NEW --- | ||
Severity: | blocking | CC: | alkisg, fernando.filgueira, gurselm, hydrapolic, szg00000 |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | All versions after and including v4.2.0 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | system info |
Description
reserv0
2015-12-27 10:54:44 UTC
Created attachment 198341 [details]
system info
Kernel v4.4.0 is also affected by this show-stopper bug... Here is a log of the 900 Mb RAM-disk saving and restoring for today's reboots (as I compiled v4.4 and rebooted several times to test it); look at the last RAM-disk stopping for v4.4.0: almost 8 minutes instead of a dozen of seconds (of course, the RAM-disk contents was the same, all along) !!! 2016-01-11 11:35:46 - Linux v4.1.15: Stopping ramdisk 2016-01-11 11:35:53 - Linux v4.1.15: Ramdisk stopped 2016-01-11 11:39:25 - Linux v4.4.0: Starting ramdisk 2016-01-11 11:39:38 - Linux v4.4.0: Ramdisk started 2016-01-11 11:51:25 - Linux v4.4.0: Stopping ramdisk 2016-01-11 11:51:39 - Linux v4.4.0: Ramdisk stopped 2016-01-11 11:52:20 - Linux v4.1.15: Starting ramdisk 2016-01-11 11:52:32 - Linux v4.1.15: Ramdisk started 2016-01-11 12:58:01 - Linux v4.1.15: Stopping ramdisk 2016-01-11 12:58:08 - Linux v4.1.15: Ramdisk stopped 2016-01-11 12:58:51 - Linux v4.4.0: Starting ramdisk 2016-01-11 12:59:03 - Linux v4.4.0: Ramdisk started 2016-01-11 13:00:25 - Linux v4.4.0: Stopping ramdisk 2016-01-11 13:00:38 - Linux v4.4.0: Ramdisk stopped 2016-01-11 13:01:20 - Linux v4.4.0: Starting ramdisk 2016-01-11 13:01:32 - Linux v4.4.0: Ramdisk started 2016-01-11 13:37:23 - Linux v4.4.0: Stopping ramdisk 2016-01-11 13:37:31 - Linux v4.4.0: Ramdisk stopped 2016-01-11 13:38:15 - Linux v4.4.0: Starting ramdisk 2016-01-11 13:38:27 - Linux v4.4.0: Ramdisk started 2016-01-11 13:46:49 - Linux v4.4.0: Stopping ramdisk 2016-01-11 13:54:55 - Linux v4.4.0: Ramdisk stopped 2016-01-11 13:55:46 - Linux v4.1.15: Starting ramdisk 2016-01-11 13:55:58 - Linux v4.1.15: Ramdisk started First, a question, if someone reads me here: is anyone reading the bug reports ? I am *amazed* that I seem to be the only person in the world to care about such an issue that should have jumped in the face of anyone running Linux... o.o Second, Linux v4.4.3 is still affected by this bug, despite many fixes that went in it dealing with locking issues; I mention those, because I believe the problem stems from a deadlock somewhere in the I/O subsystem: I made another experience by performing a (parallel, with 'make -j6' on a quad-core) compilation of a large software and half down the road (I won't be surprise if the bug would be happening only after 4 Gb or so of data has been read or written to the disks), the compilation speed slows down tremendously the cores getting "stuck" one after the other (the Gnome system monitor reports them in "I/O latency", till only 25% of the processor full power gets stuck idling in I/O and almost 0% is left for "user" code execution). I also made the same test with the same kernel, compiled for 64 bits and running on a 64 bits distro, with the exact same result. "First, a question, if someone reads me here: is anyone reading the bug reports ?" ***Yes. But your question is relevant. I tried the "Assigned To" e-mail address of this bug, and this comes back: ******************************************************************************* I'm sorry to have to inform you that your message could not be delivered to one or more recipients. It's attached below. For further assistance, please send mail to postmaster. If you do so, please include this problem report. You can delete your own text from the attached returned message. The mail system <io_other@kernel-bugs.osdl.org>: unable to look up host kernel-bugs.osdl.org: hostname nor servname provided, or not known ******************************************************************************* Kernel v4.4.4 is still affected by this bug. I however tried on a second machine (since it is a show stopper bug and I have little free time for such an experimentation, I didn't bother before): the bug is not triggered on that other computer (Q6600 CPU, running the same distro): it might therefore be a hardware-dependent bug (CPU ? Motherboard ?). Kernel v4.4.6 is still affected by this bug Kernel v4.5 is also affected by this bug; I'm starting to desperate to see it fixed... If only I could get a comprehensive log of the changes that went on between v4.1 and v4.2 (i.e. when that nasty bug was introduced), I could investigate, spot the potentially problematic commits and revert them one by one till I get a working kernel... Alas, unlike what happens for micro releases (e.g. ChangeLog-4.2.1), there doesn't seem to be a change log for minor releases updates (which is quite surprising and a major flaw in the Linux kernel changes traceability)... I did have a look to the commits stats between v4.1 and v4.2 (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/diff/?id=v4.2&id2=v4.1&context=3&ignorews=1&dt=2), but that's quite a long list of changes, which are not grouped by commit and thus quite unpractical. If any one could kindly point me to what kind of change between v4.1 and v4.2 could induce such a bug, I'd be grateful and could probably help narrowing the issue, if not solve it... Kernel v4.5.1 is still affected by this bug; v4.1.21 is the last working kernel for me... v4.5.2 is still affected by this bug. :-( v4.5.3 is still affected by this bug. Anyone working on this, or willing to try and give me clues about what, in v4.2 changes, could have broken I/O so badly ?... With v4.5.4, looking at the change log and noticing "writeback: Fix performance regression in wb_over_bg_thresh()", which looks really similar to what I'm experiencing here, I had the hope that the bug would have finally been solved... Alas, v4.5.4 is still affected. v4.6.0 is affected as well... This issue still happens with Linux kernel v4.7.3. Note however that I since installed another, newer Linux distribution (same lineage, same compiler version, but newer glibc: v2.20 instead of v2.13) on the affected computer, with pretty much the exact same configuration (in particular, same kernel (vanilla) and kernel config options). Under this new distro, I don't encounter the bug any more. That bug might therefore get triggered only on a particular combination of hardware and glibc version. Probably some deadlock issue somewhere in the semaphores... If only there were a comprehensive list of kernel changes (between v4.1 and v4.2), a commit per commit reversal would allow to pinpoints the culprit change... Interesting news about this bug: Linux v4.8.4 fixed the issue for 64 bits kernels... But *not* for 32 bits (+PAE) ones (same system -Rosa 2012 32 and 64 bits- and kernel configuration in both cases). If anyone knows what patch that went into v4.8.4 may affect 32 and 64 bits kernels differently, please let me know, and I'll revert the said pcath on the 64 bits system to confirm what fixed the issue (and what is left to be done for 32 bits). Bug still present, for 32 bits kernels only, in v4.9.2. In case it would be a lock contention in the I/O scheduler I also tried changing it from cfq to deadline, noop and even bfq, to no avail. Please try to reproduce this bug with latest kernel image. Bug still present for 32 bits kernel in v4.10.3. I noticed that, when I do a large software compilation, at some point (after several gigabytes of disk writes), the "I/O latency" (as reported by the Gnome system monitor applet) skyrockets, eating up all the CPU core power. If I pause the compilation (CTRL S in the terminal), then the disk writes slowly proceed and the I/O CPU consumption goes down. If I wait for all writes to finish, I can un-pause the compilation process (which will slow down again to a crawl a minute later, but will be OKeyish till then)... Perhaps something to do with changes in disk writes caching code ? The affected system got 32Gb of RAM (and a 32 bits kernel with PAE). Bug still present for 32 bits kernel in v4.11.2... We are getting closer to the EOL of the last working kernel version for me: v4.1... Someone, please, look at this show stopper bug... reserv0, I think the bug you're describing is the same with the one I filed against Ubuntu's kernel a few weeks ago: https://bugs.launchpad.net/ubuntu/+source/linux-hwe/+bug/1698118 I.e. that disk writes start fast but end up to 800 times slower, in all kernels 4.x, on i386 systems, with 16 GB RAM or more. 3.x kernels don't have that issue, on the same systems. And all 64bit installations on the same systems are unaffected too. I have a very easily reproducible test case there, i.e. a script that copies /lib around 100 times. The first copy happens in 5 seconds and the 30th in 800+ seconds when the pagecache gets full. echo 3 > /proc/sys/vm/drop_caches clears the performance issue for a few minutes. Is this the same bug that you're experiencing? Can you reproduce my test case in your systems? I think that the component "IO Storage" might be wrong though, it might be more related to pagecaching. Greetings Alkis. Glad to see I'm not alone in the whole world to encounter this show stopper bug on his system... I can indeed confirm that limiting the available RAM to 12Gb (via 'mem=12G' in the kernel command line) does prevent the problem from occurring. However, 'echo 3 >/proc/sys/vm/drop_caches' does not achieve anything on my system and the disk writes are still dead slow after that. It definitely looks like an issue related with free RAM usage for caching purpose... Now to find what broke it in the gazillion of commits applied between 4.1 (last working kernel version) and 4.2.0 (first broken version)... Bug still present for 32 bits kernel in v4.12... Bug still present for 32 bits kernel in v4.13.2 Bug still present for 32 bits kernel in v4.14.2. Bug still present for 32 bits kernel in v4.15.12. Bug still present for 32 bits kernel in v4.16.3, and with the "end of life" of v4.1 getting close, I'm worried I will be left without any option to run a maintained Linux kernel on 32 bits machines with 16 Gb of memory or more... Bug still present for 32 bits kernel in v4.17.0, and with the "end of life" of v4.1, I will be left without any upgrade path... Pretty please, consider keeping v4.1 live till this *showstopper* bug gets solved ! Bug still present for 32 bits kernel in v4.18.1, and now, v4.1 (last working Linux kernel for 32 bits machines with 16Gb or more RAM) has gone unmaintained... Had similar problem (mine is amd64), tried every kernel shipped with ubuntu and debian and had the problem, then I dowloaded and compiled any kernel (4.16 5.1 and 5.4) from: https://cdn.kernel.org/pub/linux/kernel/ and the problem went away. (In reply to fernando from comment #28) > Had similar problem (mine is amd64), tried every kernel shipped with ubuntu > and debian and had the problem, then I dowloaded and compiled any kernel > (4.16 5.1 and 5.4) from: > > https://cdn.kernel.org/pub/linux/kernel/ > > and the problem went away. I'm afraid the problem is NOT solved. The last kernel version I tried (v5.4.4) did see an improvement (i.e. the problem surfaces later, after a larger amount of data writes have occurred; e.g. in my test case compilation, it appears at 50% instead at 25% of the total compilation), but the problem is still there, and only kernels v4.1.x and older are exempt of it. Note that the amount of RAM you are using does also impact how fast the problem arises (or whether it will arise at all). I'm using 32 Gb here. |