Reporting here to track a regression apparently introduced when 9p switched to netfslib in commit 80105ed2fd2715fb09a8fdb0655a8bdc86c120db "9p: Use netfslib read/write_iter" that is being reported on mailing list and in distros: "Re: [PATCH v5 40/40] 9p: Use netfslib read/write_iter" https://lore.kernel.org/all/Zj0ErxVBE3DYT2Ea@gpd/ "linux: regression in the 9p protocol in 6.8 breaks autopkgtest qemu jobs (affecting debci)" https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1072004 "autopkgtest-virt-qemu on noble images sometimes hangs doing copydown" https://bugs.launchpad.net/ubuntu/+source/autopkgtest/+bug/2056461 I've been digging into the bug and collecting data 9p developers may better understand. Host is Debian 12 Bookworm amd64 building the example Debian package "hello" and running autopkgtest in a QEMU virtual machine with Debian unstable (sid) and local build of latest mainline kernel (currently 4a4be1ad3a6). The bug was first reported against v6.8. I was hoping to create a 'simple' reproducer without the complexity of autopkgtest but first need to understand what is going wrong. After 31 iterations I believe I may finally have found a clue. It appears (in the guest OS of course) v9fs_file_read_iter() doesn't consume the complete content of the host file being shared via 9p. In the reproducer the host autopkgtest Python scripts control qemu-system-x86 and send commands to the guest OS for it to execute. These include dynamically writing and then executing a python script named "eofcat" [0] that is used as a "funnel" for stdin [1]. "eofcat" spins on a non-blocking read of its fd 0 until a specific file exists or the number of bytes read from fd 0 is >= to a limit value read from the file name passed in as argv[1]. By modifying the guest's mount -t 9p command [2] options to include "debug=0x08" and using "| ts %.T |" in logging pipelines to capture microsecond accurate timestamps I was able to identify the guest kernel log entries when autopkgtest 'hangs' (spins). What I see is the host file "stdin" being redirected to "eofcat" fd 0 and guest kernel shows a series of v9fs_file_read_iter(count=1000000, ...) calls with "offset" increasing each time: 0, 1000000, 2000000, 3000000, 4000000, 5000000, 5416960. At that point it starts spinning - repeatedly retrying a v9fs_file_read_iter(count=1000000, offset=5416960) and a single read of "stdin_eof" (to get the number of bytes to expect - the "limit" variable) - and that has 8253440 which is also the size of "stdin" file on the host. In the guest a stat or ls or tail of the shared "stdin" file will break the deadlock and operations continue. No 9p cache options are in use. In following comments I'll attach run logs of the host autopkgtest and the guest kernel dmesg with microsecond precision timestamps with mount option debug=0x0f ( P9_DEBUG_VFS | P9_DEBUG_9P | DEBUG_CURRENT | P9_DEBUG_ERROR ) with the exact timestamp where the host hangs, which can then be found in the guest kernel log. [0] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-virt-qemu/?hl=320#L314 [1] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-virt-qemu/?hl=492#L487 [2] https://sources.debian.org/src/autopkgtest/5.35/virt/autopkgtest-virt-qemu/#L297
Interesting finding: with debug=0x0f the bug is not triggered - I have to reduce the debug flags to trigger it. Currently working up from P9_DEBUG_VFS: With run #34 and debug=0x08 (P9_DEBUG_VFS) I captured logs (attached). Hang occurs in autopkgtest at timestamp 10:08:37.390685 (UTC+1). In the guest kernel log that maps to timestamp 09:08:37.408129. The first read on fd 0 is: 09:08:37.481590 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 0 followed by subsequent reads: 09:08:37.517808 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 1000000 ... 09:08:37.565864 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 2000000 ... 09:08:37.609522 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 3000000 ... 09:08:37.653503 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 4000000 ... 09:08:37.693647 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 5000000 ... 09:08:37.716325 v9fs_file_read_iter (1049): fid 6 count 1000000 offset 5275648 At which point it starts 'spinning' waiting for more data. On the host we see: ~/tmp/9pfs$ ls -latr autopkgtest-qemu.ngbyoymu/shared/job.mon5m14s/ total 8072 -rw-r--r-- 1 root root 0 2024-05-31 10:08 stdout -rw-r--r-- 1 root root 0 2024-05-31 10:08 stderr drwxrwxrwt 3 root root 4096 2024-05-31 10:08 .. -rwxr-xr-x 1 root root 8253440 2024-05-31 10:08 stdin -rw-r--r-- 1 root root 7 2024-05-31 10:08 stdin_eof drwxr-xr-x 2 root root 4096 2024-05-31 10:08 .
Created attachment 306383 [details] #34 autopkgtest log
Created attachment 306384 [details] #34 kernel log, debug=0x08
Run #36 with debug=0x0c ( P9_DEBUG_VFS | P9_DEBUG_9P). Hang occurs in autopkgtest at timestamp 10:56:44.002011 (UTC+1). In the guest kernel log that maps to timestamp 09:56:44.666943 where the first v9fs_file_read_iter(count=1000000, offset=0) occurs. What is interesting is the reads fetch slightly more data before spinning at: 09:56:56.732407 v9fs_file_read_iter (885): fid 6 count 1000000 offset 7331840 Do we have a timing/race condition here?
Created attachment 306386 [details] #36 autopkgtest log
Created attachment 306387 [details] #36 kernel log, debug=0x0c
Considering the (changed) source-code of fs/9p/vfs_file.c::v9fs_file_read_iter() in commit 80105ed2fd2715fb09a8fdb0655a8bdc86c120db. Prior to the commit there was a code path specifically for O_NONBLOCK (as well as for p9L_DIRECT). Now there is only a code-path for P9L_DIRECT. kernel.36.log shows that within this function the p9L_DIRECT path is taken since there is no debug message from the final debug message: if (fid->mode & P9L_DIRECT) return netfs_unbuffered_read_iter(iocb, to); p9_debug(P9_DEBUG_VFS, "(cached)\n"); return netfs_file_read_iter(iocb, to); } I'm not familiar enough with the netfs layer to understand if O_NONBLOCK is being handled (correctly) there - if special handling is indeed required?
Update: 6.10-rc2 still fails the same way. I've been writing a (minimal!) reproducer script that uses the same mechanisms as autopkgtest to copy files across but without any of the autopkgtest infrastructure - just host, qemu guest, and the shared 9P directory. Even with 2500 iterations of the tar-based copy of the "hello" package source and build product so far it has failed to reproduce. I intend to add lots of unique individual file copies since autopkgtest does a lot fo that prior to hitting the tar-based transfer hang.
Can you try turning on some netfs tracepoints? echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_write/enable echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_write_iter/enable and then capturing the result?
Created attachment 306436 [details] Contains run #120 autopkgtest and kernel log files v6.10-rc2 netfs debug logging as requested in run #120. Attachment contains the autopkgtest and virtual machine kernel log files, timestamped to aid synchronisation between them (autopkgtest is UTC+1 though)
Created attachment 306437 [details] Test case framework Although I've not get managed to reproduce the bug with the "minimal" (ha!) reproducer script and tests it is in a state now that others should be able to easily execute it, and add more tests, and otherwise try to replicate the bug. Once extracted read ./bug218916-reproducer.bash or get it to explain how to use it with: ./bug218917-reproducer.bash So far it has not reproduced the bug, but it is a framweork and one can easily add additional test scripts (that are source-d) in ./bug218916.d/ Currently it has test_autopkgtest that replicates the sequence of copy{down,up} operations autopkgtest does but without the autopkgtest infrastructure, and test_tar that does the single copydown of a directory (which causes the transfer to be piped via "tar"). It doesn't execute quite as fast native autopkgtest because it has to use a "read -t 0.5" timeout for extracing replies from the controlling ssh bash coproc. That was implemented to avoid repeatedly suffering the delay of creating an ssh session for every transfer operation.
Edit: disastrous typos in my last comment! the command name should, of course, be: ./bug218916-reproducer.bash
Created attachment 306439 [details] Contains run #121 autopkgtest, kernel, trace log files Realised I missed the trace log from run #120! This tar archive for run #121 contains it
For run #121 the reading of "stdin" by "eofcat" is at dmesg timestamp 13:46:26 with v9fs_read_iter (760) at offset 0 and is at 35.861430 netfs_read s=0 in the trace.
Testing different scenarios in autopkgtest scripts, and especially "eofcat" Python that reads from the 9P mount in the guest, I experimented with setting a larger time.sleep(0.2) instead of 0.05, and separately, removing the O_NONBLOCK flag from its stdin FD. Neither showed any difference - the hang occurs. I'm about to test with enabling internal.h's CONFIG_NETFS_DEBUG=y to capture more clues from the actual IO dispatch functions.
Created attachment 306443 [details] Contains run #132 netfs_debug autopkgtest, kernel, trace log files dmesg.132.log now contains the additional netfs_debug messages. the read of "stdin" file by "eofcat" starts at 10:23:43.
Created attachment 306444 [details] Fix option 1 At last I've found the problem (which is, as I thought a week ago, due to 'stat'), but quite subtle. With extra debugging enabled on the mount (debug=0x0f rather than debug=0x08) it works, so with debug=0x08, after adding additional debug messages in v9fs_stat2inode_dotl(), I can see that in the failing case there are 2 calls to that function immediately before the v9fs_file_read_iter() starts. In those calls the netfs->remote_i_size is updated from stat->st_size and I see st_size has different (increasing) values but smaller than the actual file size. So, effectively, all the data hasn't arrived before the read and once the read starts the size is no longer updated. When debugging the additional cumulative delays of writing extreme amounts of debug info to the kernel log gives enough time for the data to arrive so that when v9fs_stat2inode_dotl() is called the final size is correct. This possible fix updates stat if a read returns 0 bytes. Tested with 6.8.12 and 6.10-rc2 repeatedly (10+ iterations so far) and no failures with debug logging completely disabled.
Extract from the fix log with additional debugging enabled: 9pnet: -- v9fs_file_read_iter (889): fid 6 count 1000000 offset 5580800 [eofcat] ==> netfs_unbuffered_read_iter_locked() [eofcat] ==> netfs_begin_read(R=111 552800-646a3f) [eofcat] submit 552800 + 0 >= 552800 9pnet: -- v9fs_file_read_iter (889): iocb->ki_flags 0 netfs_unbuffered_read_iter()=0 9pnet: -- v9fs_file_read_iter (889): no data; refreshing inode 9pnet: -- v9fs_stat2inode_dotl (889): P9_STATS_BASIC=2047 V9FS_STAT2INODE_KEEP_ISIZE=0 P9_STATS_SIZE=512 9pnet: -- v9fs_stat2inode_dotl (889): P9_STATS_BASIC: remote_i_size=5580800 stat->st_size=8253440
Created attachment 306445 [details] Fix v2 (deals with .U as well as .L ) Revised patch that should handle .U as well as .L protocol. 10 iterations each of autopkgtests with v6.8.12 and v6.10-rc2 all successful.
Thanks TJ. I tried applying your patch and can confirm I am still seeing the same problem with autopkgtest on Ubuntu Noble: ... qemu-system-x86_64: terminating on signal 15 from pid 5349 (/usr/bin/python3) autopkgtest [17:52:50]: ERROR: testbed failure: sent `auxverb_debug_fail', got `timeout', expected `ok...'
@Kip can you provide a complete autopkgtest log ? I use the following: sudo TMPDIR=$PWD autopkgtest -ddd -U ./hello_2.10-3.dsc -- autopkgtest-virt-qemu --debug --s how-boot ./sid.img |& ts %.T | /usr/bin/tee ./aptest.log
I'm having a bit of trouble getting your commands to work. I'm assuming the first two lines are one command wrapped, and the third line is its own? I'm also assuming I should also replace hello_2.10-3.dsc with hello_2.10-3build2.dsc on my system, and the sid image with autopkgtest-noble-amd64.img. When I try to run the first command I see: $ sudo TMPDIR=$PWD autopkgtest -ddd -U ./hello_2.10-3build2.dsc -- autopkgtest-virt-qemu --debug --s autopkgtest: DBG: autopkgtest options: Namespace(override_control=None, only_tests=[], skip_tests=None, built_binaries=True, packages=['./hello_2.10-3build2.dsc'], output_dir=None, logfile=None, summary=None, verbosity=2, setup_commands=['(O=$(bash -o pipefail -ec \'apt-get update | tee /proc/self/fd/2\') ||{ [ "${O%404*Not Found*}" = "$O" ] || exit 100; sleep 15; apt-get update; } || { sleep 60; apt-get update; } || false) && $(command -v eatmydata || true) apt-get dist-upgrade -y -o Dpkg::Options::="--force-confnew" && $(command -v eatmydata || true) apt-get --purge autoremove -y'], setup_commands_boot=[], add_apt_sources=[], add_apt_releases=[], pin_packages=[], apt_pocket=[], apt_default_release=None, enable_apt_fallback=True, copy=[], env=[], ignore_restrictions=[], user=None, gainroot=None, shell_fail=False, shell=False, timeout=0, timeout_short=None, timeout_copy=None, timeout_install=None, timeout_test=None, timeout_build=None, timeout_factor=1.0, set_lang=None, auto_control=True, build_parallel=None, needs_internet='run', validate=False) autopkgtest: DBG: virt-runner arguments: ['autopkgtest-virt-qemu', '--debug', '--s'] autopkgtest: DBG: actions: [('source', './hello_2.10-3build2.dsc', True)] autopkgtest: DBG: build binaries: True autopkgtest: DBG: testbed init autopkgtest [10:33:25]: starting date and time: 2024-06-14 10:33:25-0700 autopkgtest [10:33:25]: version 5.35 autopkgtest [10:33:25]: host kip-desktop; command line: /usr/bin/autopkgtest -ddd -U ./hello_2.10-3build2.dsc -- autopkgtest-virt-qemu --debug --s usage: autopkgtest-virt-qemu [-h] [--qemu-architecture QEMU_ARCHITECTURE] [--dpkg-architecture DPKG_ARCHITECTURE] [-q QEMU_COMMAND] [-o OVERLAY_DIR] [-u USER] [-p PASSWORD] [-c CPUS] [--ram-size RAM_SIZE] [--timeout-reboot SECONDS] [--show-boot] [--timeout-poweroff SECONDS] [-d] [--qemu-options QEMU_OPTIONS] [--baseimage] [--boot {auto,bios,efi,ieee1275,none}] [--efi] images [images ...] autopkgtest-virt-qemu: error: the following arguments are required: images autopkgtest: DBG: sending command to testbed: auxverb_debug_fail autopkgtest: DBG: TestbedFailure cannot send to testbed: [Errno 32] Broken pipe autopkgtest: DBG: testbed stop autopkgtest: DBG: testbed close, scratch=None autopkgtest: DBG: sending command to testbed: quit autopkgtest: DBG: TestbedFailure cannot send to testbed: [Errno 32] Broken pipe autopkgtest: DBG: testbed stop autopkgtest: DBG: auxverb_debug_fail failed: cannot send to testbed: [Errno 32] Broken pipe autopkgtest: DBG: TestbedFailure unexpected eof from the testbed autopkgtest: DBG: testbed stop autopkgtest [10:33:25]: ERROR: testbed failure: unexpected eof from the testbed autopkgtest: DBG: testbed stop
Created attachment 306462 [details] Kip's log Apologies, I misread your command line. I ran the following and this is my generated log. $ sudo TMPDIR=$PWD autopkgtest -ddd -U ./hello_2.10-3build2.dsc -- qemu --debug --show-boot ~/Projects/sbuild/images/autopkgtest-noble-amd64.img |& ts %.T | /usr/bin/tee ./aptest.log
@Kip Thanks for the log. The kernel used there does not have my patch applied that I can tell - unless it is included in the Ubuntu kernel build already - but unlikely given that kernel is dated May 20th: 12:09:49.192706 [ 0.000000] Linux version 6.8.0-35-generic (buildd@lcy02-amd64-020) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #35-Ubuntu SMP PREEMPT_DYNAMIC Mon May 20 15:51:52 UTC 2024 (Ubuntu 6.8.0-35.35-generic 6.8.4) In my case I manually boot the image file with QEMU and install patched kernel builds into it (using ssh/scp/rsync), do an `apt update; apt upgrade` to prevent autopkgtest doing that in just the (temporary) overlay image it creates for each run. Then ensure the required GRUB menu entry will be used with 'grub-set-default' [0]. [0] grub-set-default "1>4" is the kind of syntax required; this would select the 5th "menuentry" from the 1st "submenu" (indexes begin at 0). The value is written into `/boot/grub/grubenv` that is read early when GRUB loads the generated grub.cfg at boot-time. To identify the correct value I do a manual count using this (one-line!) command: awk '{id=gensub(/.*(gnulinux-[^'"'"']*).*/,"\\1","g")};/menuentry /{prefix="";if(submenu){prefix=submenu ">";}print("ENTRY \"" prefix menu++ "\"",id)} /^submenu/{print("SUBMENU",++submenu,id);menu_saved=menu;menu=0}/^}/ && submenu {--submenu;menu=menu_saved}' /boot/grub/grub.cfg
Ah apologies TJ. I thought the patched kernel needed to be running on the host and not in the guest. I will try and build the kernel now and modify the image for noble via QEMU. But it might be easier, if you were able to create an image for noble for amd64 for me that I could try because otherwise I might end up fiddling with grub incorrectly, etc., and not get the new kernel to be used.
TJ, I'm happy to say that the hello package's DEP-8 tests all appears to pass with your patched kernel. I'll just keep the .img around for my own packages' testing. Do you think your fix will make it into Ubuntu Noble in the near future?
I've testing this with a Debian unstable image and a patched 6.8.12 kernel and I'm also happy to say that it fixed the issue for me, too.
Created attachment 306479 [details] Kip's second test log TJ, I may have spoke too soon. I've tried running it again on the hello source. My image uses your patched kernel via a PPA with the upgrade performed via the --setup-commands: https://launchpad.net/~kip/+archive/ubuntu/linux-kernel-6.8.0 I am still seeing the following: ERROR: testbed failure: sent `auxverb_debug_fail', got `timeout', expected `ok...' To rule out the possibility of an incorrectly setup image, would you be able to generate an Ubuntu Noble image for me to test? I've been getting the image via: $ autopkgtest-buildvm-ubuntu-cloud --verbose -r noble -a amd64 --cloud-image-url http://cloud-images.ubuntu.com/daily/server ...followed by running the --setup-commands to a bash script containing: $add-apt-repository ppa:kip/linux-kernel-6.8.0 $ apt upgrade --no-install-suggests --no-install-recommends -y I will attach my log.
(In reply to Kip Warner from comment #28) > TJ, I may have spoke too soon. I've tried running it again on the hello > source. My image uses your patched kernel via a PPA with the upgrade > performed via the --setup-commands: > > https://launchpad.net/~kip/+archive/ubuntu/linux-kernel-6.8.0 I see no sign of the patch being applied in the build log; I see it in the source diff but it would not be applied at build-time since it is in the wrong place. https://launchpad.net/~kip/+archive/ubuntu/linux-kernel-6.8.0/+files/linux_6.8.0-31.31_6.8.0-35.35ubuntu1+kip1.diff.gz Shows a patch file at: --- linux-6.8.0.orig/patches/fix-218916.patch +++ linux-6.8.0/patches/fix-218916.patch but no sign of it being applied in the tree. Ubuntu kernel development flow is different to Debian; it's more like mainline with changes committed directly in the kernel tree. On upload a source diff is then generated between mainline and the Ubuntu tree to create the usual Debianised orig.tar.gz/diff.gz packages that the buildd's use.
I'm sorry TJ, but I'm not sure what to do. I've lost several days on this now and it's still not clear how to apply your patch to the Ubuntu source package. My understanding was that quilt(1) was the standard way to apply a downstream patch before the package is built. I'll keep fiddling with it in the mean time, but if you can provide any pointers that would be helpful.
This is what I'm doing: $ apt source linux $ linux-6.8.0 $ quilt new bug218926.patch $ quilt add fs/9p/vfs_file.c $ patch -p1 < ../bug218916_fix_option_02.patch $ quilt refresh $ dch --changelog debian.master/changelog -i $ debuild -S -sa --lintian-opts --no-lintian $ dput ppa:kip/linux-kernel-6.8.0 linux_6.8.0-36.36ubuntu1+kip2_source.changes Before uploading I did try running dpkg-source -x on the generated source package in a temp directory and verified that your patch was applied correctly on the unpacked source tree. The package is building now in the PPA and I will let you know what happens.
Scratch that, it's not building. Kernel package is too large for it and there is a 6 hour interval for the builder's cron garbage collector to remove the previous build. TJ, are you able to create a PPA for Noble I could test?
Created attachment 306506 [details] Kip's log of 27 June, 2024 I created a new PPA: https://launchpad.net/~kip/+archive/ubuntu/kernel-fix-218916 The source package appears to have applied the patch during unpacking: https://launchpad.net/~kip/+archive/ubuntu/kernel-fix-218916/+build/28611783/+files/buildlog_ubuntu-noble-amd64.linux_6.8.0-36.36ubuntu1+kip2_BUILDING.txt.gz My testbed installed the new kernel upgrade, but same problem. Test bed times out. This is the log.
Created attachment 306510 [details] Kip's autopkgtest log of 28 June, 2024 Tried again with 6.8.0-38 kernel. This time I didn't use quilt(1), but patched the source directly and commit into my local tree before building the source package like so: https://wiki.ubuntu.com/Kernel/Dev/KernelBugFixing The source package was built like so: https://launchpadlibrarian.net/737080400/buildlog_ubuntu-noble-amd64.linux_6.8.0-38.38ubuntu1+kip1_BUILDING.txt.gz And the attached is the resulting autopkgtest log. Note the timeout is still occurring.
I've tested the command below both with an unpatched and a patch Debian 6.9.7 kernel. The test hung as expected with the unpatched kernel, and succeeded with the patched kernel. There was one gotcha here: I didn't bump my kernel version upon rebuild, so when the test started, the original kernel was installed again (as an upgrade), the VM rebooted, and the test hung again. I sidestepped this by reinstalling the rebuilt kernel, and setting the hold flag. @ Kip In the test log you most recently attached, the kernel version is Ubuntu 6.8.0-36.36-generic with build date June 10, so I don't think this is your rebuilt kernel? Side note, Ubuntu kernel's source package format is 1.0, not 3.0 (quilt), so that's why your patch was ignored in your previous build. See dpkg-source(1).
Forgot to say: the command I tested was $ apt-get source hello $ autopkgtest -U hello*.dsc -- qemu ../unstable-amd64.img
@Kip I reviewed the diff and build log at https://launchpad.net/~kip/+archive/ubuntu/kernel-fix-218916/+build/28612347 for 6.8.0-38.38ubuntu1+kip1 and again I do not see evidence of the patch applied when looking at the diff. I attempted to do an "Ubuntu style" git-based build here but the tooling eventually defeated me die to hard-coding GCC v13 in the build scripts, where my linux-builder nspawn container (Debian Bookworm) uses GCC v12. The build instructions for Ubuntu are found at: https://help.ubuntu.com/community/Kernel/Compile But as I only build mainline kernels using the mainline build system I have neither time nor inclination to delve deeper into Ubuntu. I am sure, however, you're making a mistake at some point and the patch isn't being applied in the builder.
Created attachment 306513 [details] Kip's aptest.log @TJ, the patch was applied to the source tree directly, not through quilt, this time. So it shouldn't be seen being applied after the source package is unpacked. It's applied to the actual source tarball, if I did things correctly. As @Christian pointed out the kernel binaries weren't being installed for 6.8.0-38. Apt saw the packages but didn't prompt for an upgrade. I added a setup script which manually installs them like so: $ add-apt-repository ppa:kip/kernel-fix-218916 $ apt upgrade --no-install-suggests --no-install-recommends -y $ apt install linux-headers-6.8.0-38 linux-headers-6.8.0-38-generic linux-image-unsigned-6.8.0-38-generic linux-modules-6.8.0-38-generic linux-modules-extra-6.8.0-38-generic This is the command I'm using to start the test bed: $ sudo TMPDIR=$PWD autopkgtest --apt-upgrade --setup-commands=/home/kip/Projects/sbuild/scripts/setup.sh -ddd -U ./hello_2.10-3build1.dsc -- qemu --qemu-options=-enable-kvm --cpus=8 --ram-size=8192 --debug --show-boot ~/Projects/sbuild/images/autopkgtest-noble-amd64.img |& ts %.T | /usr/bin/tee ./aptest.log The packages do install, the system boots into my patched kernel, and the eof error is gone. However, and this may or may not be related to your patch, but I am seeing another timeout with network connectivity appearing to die which chokes apt. Log attached.
In case it's any help, I tried removing --qemu-options=-enable-kvm --cpus=8 --ram-size=8192, but it made no difference. To obtain the base image, this is how: $ autopkgtest-buildvm-ubuntu-cloud --verbose -r noble -a amd64 --cloud-image-url http://cloud-images.ubuntu.com/daily/server
I've now tested the patch on Ubuntu and it fixes the issue there, too. @Kip, I didn't save the history, but the kernel rebuild was something like the following: $ apt-get source linux && cd linux-* $ patch -p1 < bug218916_fix_option_02.patch # Rebuild the source package so the patch is included in the diff # (Don't know if this is necessary, it's been ages since I've built a 1.0 package) $ dpkg-source -b . # Build kernel # dpkg-buildpackage I served the contents of the build result directory via python3 -m http.server, booted the guest, then downloaded and installed the image and module .debs from within. It was necessary to place a hold on the packages, otherwise they would be "upgraded" to the original kernel when a test was run. See the gotcha mentioned earlier. Can someone ping the mailing list again? I think this issue might have dropped off the radar.
Thanks @Christian. As I mentioned I did manage to get the patched kernel to install and boot in the test bed, but now networking appears to be dead. Log attached above.
@Kip
@Kip: the log shows you'd customised the base sbuild image in some way which caused the initial apt-update to remove a whole bunch of network related packages. 12:17:40.941559 The following packages will be REMOVED:
@TJ, as far as I know, I haven't done anything to the base image. Can you try it on the noble image and check? You can use my PPA with your kernel fix in there. $ autopkgtest-buildvm-ubuntu-cloud --verbose -r noble -a amd64 --cloud-image-url http://cloud-images.ubuntu.com/daily/server
> Can someone ping the mailing list again? I think this issue might have > dropped off the radar. I've now subscribed to this bz so no need to ping the list again (and David and Eric have been there from the start and should be getting the mails too... But at this point you won't get more people from the list anyway) As I said on the list I don't think the fix is appropriate -- and I still haven't had the time to dig into this properly. My understanding so far is that the guest is reading a file that is being modified from the host, and then for some reason the file size isn't coherent internally and the read falls short despite userspace knowing the file is bigger. Is that correct? We don't have cache invalidation in 9p so for cache=loose/fscache we just don't support files being modified from the host and I'd understand this behaviour, but for cache=none (the default you're using here) we should just not limit the read sizes to whatever the inode's i_size is. So the fix really isn't to do a new stat after short reads, but to ignore the size attribute in the first place. Since I haven't had time to check the traces properly I also obviously haven't had time to properly try to reproduce either, but given there is no cache invalidation I'd expect this shouldn't be difficult... I'd also expect the fix to not be too hard to do once we've been able to trace where the size gets cuts off, which is probably easy enough with qemu/GDB or for someone familiar with the netfs read helpers. So, I'm sorry after a first fix has been made -- the analysis up till now already is great help! -- but my free time has gotten very short and I'll try to get to it when possible but if someone can help with a more appropriate fix in that direction it'd be great. (Kernel work unfortunately isn't easy unless your work pays for it or you have a lot of free time, and both stopped being true for me a while ago...) At a higher level, cache=loose/fscache could also perhaps use some invalidation logic like NFS has (iirc NFS v3 re-checks the ctime every 60s or so?), but that's not needed here, cache=none (and possibly writeback) should not trust server attributes at all and should not have this problem.
Ok, so what I was missing was how eofcat works. Let's simplify the reproducer for real. I assume this used to work on older kernels, can someone confirm? 0/ some random 9p mount in a VM 1/ create 'foo' empty file (either host or VM doesn't matter) : > foo 2/ in VM, keep reading it until content appears: while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < foo 3/ on host, write some data: echo foo >> foo This doesn't get any data in VM, and I assume it used to work, and it'll likely work with your patch -- just doing a stat or anything in the vm unblocks it. As said in the previous comment, read shouldn't assume the file size and that'll solve the problem. v9fs_file_read_iter() is called with an appropriate size... So is netfs_unbuffered_read_iter() and netfs_begin_read() but the problem is that netfs_begin_read() skips the read altogether as it doesn't match its assumption for i_size In unbuffered mode, it shouldn't match i_size and just stop when read returns 0. the netfs read request was allocated with netfs_alloc_request(NETFS_DIO_READ) so at that point we still know it's uncached, but that information seems to be lost for netfs_begin_read() and I'm not sure what is best to check. @David -- now we have a simpler reproducer & know exactly what's wrong, could you look at this this week?
BTW while I still have time to type things: (In reply to TJ from comment #7) > Considering the (changed) source-code of > fs/9p/vfs_file.c::v9fs_file_read_iter() in commit > 80105ed2fd2715fb09a8fdb0655a8bdc86c120db. > > Prior to the commit there was a code path specifically for O_NONBLOCK (as > well as for p9L_DIRECT). Now there is only a code-path for P9L_DIRECT. > > kernel.36.log shows that within this function the p9L_DIRECT path is taken > since there is no debug message from the final debug message: > > if (fid->mode & P9L_DIRECT) > return netfs_unbuffered_read_iter(iocb, to); > > p9_debug(P9_DEBUG_VFS, "(cached)\n"); > return netfs_file_read_iter(iocb, to); > } > > I'm not familiar enough with the netfs layer to understand if O_NONBLOCK is > being handled (correctly) there - if special handling is indeed required? That O_NONBLOCK handling was slightly different; with O_NONBLOCK we used to stop the read loop on first server read result, even if it wasn't 0. Without O_NONBLOCK we used to loop until eof (e.g. read returns 0) For this particular case, the O_NONBLOCK handling doesn't matter at all (eofcat will loop anyway), but for synthetic servers that use files as pipes we might want the O_NONBLOCK handling back. OTOH, that made some programs like tar very slow because they set O_NONBLOCK on regular files when they don't need to; I kind of regret having accepted 52cbee2a5768 ("9p: read only once on O_NONBLOCK") now; we probably should have added a new way of setting it (e.g. at mount time). But right now there's no way of doing that, so Sergey will likely ask again when they upgrade their kernel... anyway, ETIMEDOUT.
Is there any chance that we can have 80105ed reverted until the correct cause has been identified and fixed?
So if O_NONBLOCK is set, you want a DIO read to stop after a single subrequest?
(In reply to Christian from comment #48) > Is there any chance that we can have 80105ed reverted until the correct > cause has been identified and fixed? Well, did anyone try if that is possible (with a reasonable amount of work and churn) and makes things work again? FWIW, I just tried, a simple 'git revert' fails.
Does this fix it for you? https://lore.kernel.org/r/4058793.1720703979@warthog.procyon.org.uk/ --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -77,6 +77,10 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq) * cache won't be on server and is zeroes */ __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); + if ((fid->mode & P9L_DIRECT) && + test_bit(NETFS_RREQ_NONBLOCK, &rreq->flags)) + set_bit(NETFS_RREQ_BLOCKED, &rreq->flags); + netfs_subreq_terminated(subreq, err ?: total, false); }
That won't be enough imo, there are two problems: - We're truncating the read request to the cached i_size - that odd O_NONBLOCK behaviour on regular files 9p used to have In this particular reproducer O_NONBLOCK happened to be set (in eofcat) so that part of the discussion came up, but the problem here really is the truncated IO -- in cache=none, we shouldn't rely on prior knowledge at all, so if the IO asked for a 1000 bytes read we should try to read 1000 bytes even if we previously had i_size=10 in the inode Since nothing else refreshes i_size here, just bailing out because of the O_NONBLOCK won't make the request proceed further @Christian at this point I feel that we have a clear enough understanding of the problem that it'll be faster to fix than to revert; I'm sorry about the whole thing that should have been much smoother but we're almost there, please hang on a little bit more.
Dominique seems to have been right, unfortunately this didn't fix it. I tested this with a Debian 6.8.9 kernel. @Dominique: if that is simpler for you, of course. I only mentioned a revert because I feared that the issue might have gone stale and would have been a quick fix, but I admittedly have no visibility on this beyond bugzilla here. I appreciate the effort in solving this. @Thorsten: In the Launchpad bug report, someone had reverted successfully, but that was with an older kernel (6.7.x, I think).
Does the following help if added to the previous patch? diff --git a/fs/netfs/io.c b/fs/netfs/io.c index c93851b98368..9536b2904863 100644 --- a/fs/netfs/io.c +++ b/fs/netfs/io.c @@ -458,7 +458,8 @@ netfs_rreq_prepare_read(struct netfs_io_request *rreq, if (subreq->len > ictx->zero_point - subreq->start) subreq->len = ictx->zero_point - subreq->start; } - if (subreq->len > rreq->i_size - subreq->start) + if (rreq->origin != NETFS_DIO_READ && + subreq->len > rreq->i_size - subreq->start) subreq->len = rreq->i_size - subreq->start; if (rreq->rsize && subreq->len > rreq->rsize) subreq->len = rreq->rsize; @@ -595,9 +596,6 @@ int netfs_begin_read(struct netfs_io_request *rreq, bool sync) do { _debug("submit %llx + %llx >= %llx", rreq->start, rreq->submitted, rreq->i_size); - if (rreq->origin == NETFS_DIO_READ && - rreq->start + rreq->submitted >= rreq->i_size) - break; if (!netfs_rreq_submit_slice(rreq, &io_iter)) break; if (test_bit(NETFS_RREQ_BLOCKED, &rreq->flags) &&
Thanks for the patch That appears to trust the IO worked too much? I see the reads actually returning 0 bytes all the time past the end of the file despite the 9p rread returning 0 I'll try to look at the code over the weekend
okay that is because 9p sets SREQ_CLEAR_TAIL, adding this goes one step closer: diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c index a97ceb105cd8..33087298e0be 100644 --- a/fs/9p/vfs_addr.c +++ b/fs/9p/vfs_addr.c @@ -75,7 +75,8 @@ static void v9fs_issue_read(struct netfs_io_subrequest *subreq) /* if we just extended the file size, any portion not in * cache won't be on server and is zeroes */ - __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); + if (subreq->rreq->origin != NETFS_DIO_READ) + __set_bit(NETFS_SREQ_CLEAR_TAIL, &subreq->flags); netfs_subreq_terminated(subreq, err ?: total, false); } unfortunately at this point we get -ENODATA instead of read returning 0; so we just need to return 0 instead and that'll probably work™ ran out of time again, will try to continue when able...
Happy to test but the earliest I can do so Friday, if someone else wants to beat me to it.
Created attachment 306570 [details] tenative fix to allow reads past cached size for direct IO Ok so the attached patch (0001-WIP-tenative-fix-to-allow-reads-past-cached-size-for.patch) fixes my reproducer at least, I think it'd "good enough" for testing. In practice it's probably missing some edge cases and I haven't tested it at all besides my reproducer, so here be dragons. Next to make it a bit more correct and we're good to go; but I'm surprised this is specific to 9p honestly... Just reading from an already open file in a loop would likely also not notice external appends (on another client) to the file - if it works, how does it work?
@Dominique: I've just completed 10 passes of the Debian autopkgtest without it failing so that patch is a candidate for the fix; I agree with you about how it works being a mystery!
(In reply to Dominique Martinet from comment #58) > Created attachment 306570 [details] > tenative fix to allow reads past cached size for direct IO How does it fix the O_NONBLOCK behaviour? Don't you also need to set NETFS_RREQ_BLOCKED too at some point?
The O_NONBLOCK behaviour is unrelated to this bug and a separate regression, it's a 9p specific quirk that can get fixed separately. We don't need it here as the read should also stop when a read request to the server returned 0, which it will once the end of file is reached regardless of O_NONBLOCK. The O_NONBLOCK quirk can save a round-trip here when there is data to read (the read request will bail out immediately and not after the second zero read), but in practice it makes little difference for such regular files, it's really aimed at synthetic servers which "stream" data through a regular file, so a single read can block until there is data and when that data comes the client can get it immediately... It's not even non-blocking behaviour, really ugly and makes tar very slow; so I'm hoping we can get away not reintroducing that quirk but I'll deal with it when we get there.
@Dominique: I've also completed a few test passes with your patch, without failures. Tested against two different packages with different characteristics (one with lots of transfers).
@Dominique, any news on this? (For those among us that don't monitor the mailing lists) Happy to run further tests, if needed.
Can we get this patch or some variation on it committed? The regression is affecting many Debian autopkgtest runs.
@TJ, I've tried your most recent patch on the 6.8.0-39 kernel and it does not apply: $ apt source linux $ ls 0001-WIP-tenative-fix-to-allow-reads-past-cached-size-for.patch linux_6.8.0-39.39.diff.gz linux_6.8.0.orig.tar.gz linux-6.8.0 linux_6.8.0-39.39.dsc $ cd linux-6.8.0/ $ patch -p1 --dry-run < ../0001-WIP-tenative-fix-to-allow-reads-past-cached-size-for.patch checking file fs/9p/vfs_addr.c Hunk #1 succeeded at 73 (offset -2 lines). checking file fs/netfs/io.c Hunk #1 succeeded at 523 (offset 137 lines). Hunk #2 succeeded at 596 (offset 137 lines). Hunk #3 FAILED at 597. 1 out of 3 hunks FAILED Are you able to simply create a PPA for those of us on Ubuntu to try? I've been trying to build binaries from your patch since June unsuccessfully.
Looks like most recent kernel for Noble is now 6.8.0-40 as of this morning. Patch doesn't apply to it either: $ patch -p1 --dry-run < ../0001-WIP-tenative-fix-to-allow-reads-past-cached-size-for.patch checking file fs/9p/vfs_addr.c Hunk #1 succeeded at 73 (offset -2 lines). checking file fs/netfs/io.c Hunk #1 succeeded at 523 (offset 137 lines). Hunk #2 succeeded at 596 (offset 137 lines). Hunk #3 FAILED at 597. 1 out of 3 hunks FAILED
I've stacked a copy of the patch on my netfs-fixes branch: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes It builds and seems to work with the test in comment #46 - except that it causes the generic/210 xfstest to fail with afs.
Thanks @David, but unfortunately that patch doesn't apply cleanly either to 6.8.0-40: $ patch -p1 --dry-run < ../david_howells_fix.patch checking file fs/9p/vfs_addr.c Hunk #1 succeeded at 73 (offset -2 lines). checking file fs/netfs/io.c Hunk #1 succeeded at 523 (offset -7 lines). Hunk #2 succeeded at 596 (offset -7 lines). Hunk #3 FAILED at 741. 1 out of 3 hunks FAILED
Here's a patch that works correctly with afs too: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk/ The hunk that made setting NETFS_SREQ_CLEAR_TAIL on not NETFS_DIO_READ needed replicating to other filesystems.
Unfortunately that fails too: $ patch -p1 --dry-run < ../2024-08-09\ -\ David\ Howells.patch checking file fs/9p/vfs_addr.c Hunk #1 succeeded at 73 (offset -2 lines). checking file fs/afs/file.c Hunk #1 succeeded at 243 (offset 1 line). checking file fs/ceph/addr.c checking file fs/netfs/io.c Hunk #1 succeeded at 523 (offset -7 lines). Hunk #2 succeeded at 595 (offset -7 lines). Hunk #3 FAILED at 745. 1 out of 3 hunks FAILED checking file fs/nfs/fscache.c Hunk #1 succeeded at 361 (offset -2 lines). checking file fs/smb/client/file.c Hunk #1 FAILED at 217. 1 out of 1 hunk FAILED
(In reply to David Howells from comment #69) > Here's a patch that works correctly with afs too: > > https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk/ > > The hunk that made setting NETFS_SREQ_CLEAR_TAIL on not NETFS_DIO_READ > needed replicating to other filesystems. Confirmed this solves the issue in mainline and v6.10. (In reply to Kip Warner from comment #68) > Thanks @David, but unfortunately that patch doesn't apply cleanly either to > 6.8.0-40: v6.8 is no longer supported; if Ubuntu wishes to use this fix the Canonical kernel team will need to adopts and modify the patch.
Hey TJ. That's a bit concerning since there are probably hundreds of thousands, if not millions, of users of Ubuntu Noble which is the current LTS. As far as I know, CI testing is still broken unless we can get the 6.8.0.40 kernel fixed which is the most current kernel in the repositories for the LTS. Is there any way you could upload your patched kernel to a PPA so that users can test and have a temporary workaround until Canonical's kernel team figures out what they're going to do? My CI has been broken for around two months now.
(In reply to Kip Warner from comment #72) > Hey TJ. That's a bit concerning since there are probably hundreds of > thousands, if not millions, of users of Ubuntu Noble which is the current > LTS. As far as I know, CI testing is still broken unless we can get the > 6.8.0.40 kernel fixed which is the most current kernel in the repositories > for the LTS. Thats for the Canonical kernel team to sort out. > Is there any way you could upload your patched kernel to a PPA so that users > can test and have a temporary workaround until Canonical's kernel team > figures out what they're going to do? My CI has been broken for around two > months now. I don't do anything with Ubuntu kernels; I work with mainline and only on Debian.
Thanks TJ. I'll take it up with the Canonical kernel team. Do you know roughly how long it will take for your patch to be merged?
The patch got in as e3786b29c54cdae3490b07180a54e2461f42144c ("9p: Fix DIO read through netfs") in v6.11-rc4 It has a fixes tag so Greg should pick it up for stable within ~2 weeks Unfortunately it's causing other problems for cifs so not sure how that'll end, so I'm not sure Canonical will move with that, but I was told off list that they went through the revert netfs route internally so the ubuntu kernel should work as far as I'm aware.. Either way can't do anything for you here as TJ said, please contact them directly.
Hello Dominique, I prepared an autopkgtest container like so with the 6.11-rc4 kernel installed: $ autopkgtest-buildvm-ubuntu-cloud --verbose -r noble -a amd64 --cloud-image-url http://cloud-images.ubuntu.com/daily/server --post-command "wget -c https://kernel.ubuntu.com/mainline/v6.11-rc4/amd64/linux-headers-6.11.0-061100rc4-generic_6.11.0-061100rc4.202408182119_amd64.deb https://kernel.ubuntu.com/mainline/v6.11-rc4/amd64/linux-headers-6.11.0-061100rc4_6.11.0-061100rc4.202408182119_all.deb https://kernel.ubuntu.com/mainline/v6.11-rc4/amd64/linux-image-unsigned-6.11.0-061100rc4-generic_6.11.0-061100rc4.202408182119_amd64.deb https://kernel.ubuntu.com/mainline/v6.11-rc4/amd64/linux-modules-6.11.0-061100rc4-generic_6.11.0-061100rc4.202408182119_amd64.deb && sudo dpkg -i *.deb" That builds a container image and then downloads the mainline kernel with, unless I'm mistaken, the patch in there from: https://kernel.ubuntu.com/mainline/v6.11-rc4/ I then attempt to perform CI testing for the GNU hello package with the fresh test bed: $ sudo TMPDIR=$PWD autopkgtest --apt-upgrade -ddd -U ./hello_2.10-3build1.dsc -- qemu --qemu-options=-enable-kvm --cpus=8 --ram-size=8192 --debug --show-boot autopkgtest-noble-amd64.img |& ts %.T | /usr/bin/tee ./aptest.log Like with my comments in #38, I noticed that networking was initially dead - at least the first time I ran it. I was not running any special setup commands to install or remove any packages in the test bed, other than whatever the hello package is asking the package manager to pull. I would have attached my aptest.log again but for good measure I wanted to replicate the same issue. So I ran the command again, and again, and again, and it seems to work now clobbering the old log. Let's hope it stays that way and I'm not simply catching a race condition when it's in a good mood.