How to reproduce: 1. Create jfs for loopback: mkfs.jfs /wheezy_fs 2. Mount it: mkdir /wheezy; mount -o loop /wheezy_fs /wheezy 3. Put debootstrapped wheezy there: debootstrap --arch=i386 wheezy /wheezy 4. Compile UML: cd $HOME/compile/linux-3.19.1/ make ARCH=um defconfig make ARCH=um menuconfig -- disable 64-bit kernel, enable static linking, disable loadable module support make ARCH=um -j4 5. Run UML: mkdir -p $HOME/tmp TMPDIR=$HOME/tmp ./linux rootfstype=hostfs rw init=/bin/bash rootflags=/wheezy/ 6. run ls -l ls -l At this point, if you attach a debugger, you'll see that it's endlessly looping in hostfs_readdir() by calling read_dir(). Adding printk() for *pos and ent->d_name in read_dir() shows that *pos doesn't advance beyond first entry and stays the same. Since I observe that read_dir() is used only in hostfs_readdir(), I removed calls to seekdir() and telldir(), this fixed this endless loop, through printk() I see that it gets full list of names there, but now hostfs_readdir() is being called in endless loop and eventually kernel panic happens. This is beyond my knowledge of linux kernel and I hope this little investigation is useful enough to find the root cause of the problem or at least find a way to work around that. This behaviour is observed only if host's fs is jfs -- ext4 and btrfs behave fine.
To speed up reproduction, linux config, minimal jfs filesystem image and linux binary are here -- https://hmage.net/files/bug94741.tar.gz (37.4MB)
Pushed onto my TODO! Thanks a lot for reporting!
(In reply to Richard Weinberger from comment #2) > Pushed onto my TODO! > Thanks a lot for reporting! BTW: Just gave it a try. I cannot reproduce the issue. I'm sure it depends on the host kernel. What is your version?
Host kernel is this: Linux vhmd 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux This is default kernel for debian wheezy.
Installed 3.19.1 on host machine, does not reproduce anymore. Is there a way to find a workaround for older kernels? I'm using UML as a virtualization to run debootstrap without root privileges on host machine. For same reason I can't change jfs to anything else.
I'm currently investigating what exactly is going on. Maybe a JFS bugfix needs backporting or UML's hostfs relies on a sideeffekt and need fixing.
Much appreciated. As a sidenote, not every UML user will be able to change host's kernel (one of the reasons why I'm using UML in first place), it would be more usable if changes inside UML were sufficient to fix the bug — ls -l on host's jfs works fine.
Created attachment 171231 [details] reproducer
Created attachment 171241 [details] ugly workaround
It is definitely a JFS issue, please see readdir.c. This small C program will reproduce the issue. Can you also please test the attached patch for UML? It should workaround the issue.
readdir.c reveals the bug with assert failure, as exepcted. Attached patch works around the bug, ls -l under UML now works.
Found the original JFS bug in question, it's bug 60737.
(In reply to Eugene Bujak from comment #12) > Found the original JFS bug in question, it's bug 60737. I saw that too. If I revert the said fix ontop of mainline its still works perfectly fine. I'm currently digging into the 3.2 stable tree. Maybe the #60737 got backported in a wrong manner...
(In reply to Richard Weinberger from comment #13) > (In reply to Eugene Bujak from comment #12) > > Found the original JFS bug in question, it's bug 60737. > > I saw that too. If I revert the said fix ontop of mainline its still works > perfectly fine. I'm currently digging into the 3.2 stable tree. Maybe the > #60737 got backported in a wrong manner... Indeed. Mainline commit 44512449c0ab368889dd13ae0031fba74ee7e1d2 backported to 3.2 is the root cause. But I fail to see why. I'll investigate tomorrow with a fresh brain, for now I need some sleep.
Richard, Thanks for bringing this to my attention. I found the problem. The position passed to the filldir() function needs to be incremented and was missed. In mainline, the dir_emit() function does not take a position argument, which is how we missed it.
Created attachment 171591 [details] jfs: fix readdir regression in stable trees I'll probably push the patch on Monday after I determine which stable trees need it.
Please note -- this bug is about UML's behaviour of ls -l being different under host vs UML. Reassigning this to JFS defeats the original bug report's intent. On host ls -l works fine, while under UML it does not, while using same host's jfs. It would be more logical that the buggy JFS behaviour would happen only if the user space program under UML did readdir/telldir/seekdir the way the attached reproducer does.
Alright. Take the bug back if you'd like to use it for a client-side workaround, althought the root of the problem is a bug in jfs. I'm still pushing the jfs fix to the stable kernel.
(In reply to Dave Kleikamp from comment #18) > Alright. Take the bug back if you'd like to use it for a client-side > workaround, althought the root of the problem is a bug in jfs. I'm still > pushing the jfs fix to the stable kernel. Yes, please push the JFS bugfix to stable. Working around the issue in UML is not the right solution.
Not trying to stop you from pushing the fix. Just don't want the original issue to be ignored. Old versions of host kernels will stay in the wild for a long while (hello android) while newer UML is never a problem.
UML does not work on ARM, so Android is not an issue. :) The original issue that JFS is buggy on stable 3.2 (not vanilla 3.2). User got the bug by updating kernel 3.2 to 3.2.x, they will get the fix by updating again.
Does that mean inability to update host kernel is not even considered?
If you're unable to update your kernel you're screwed in many other ways. What hinders you from running "apt-get update"?
(In reply to Richard Weinberger from comment #23) > If you're unable to update your kernel you're screwed in many other ways. > What hinders you from running "apt-get update"? Don't have root access at work -- I used virtual machine at home to reproduce the bug. I understand that my use case is an edge case at best -- I need to build, on gentoo, a custom debian bootstrap without having root while preserving all UID's/GID's and having ability to mount /proc -- UML is the best candidate for this because it lets me do whatever I want and then send the final tar.gz via hostfs. Maybe there's a cleaner way rather than posted workaround? For example, does the hostfs really need to call seekdir()/telldir() when the userspace program is just calling readdir()?
(In reply to Eugene Bujak from comment #24) > (In reply to Richard Weinberger from comment #23) > > If you're unable to update your kernel you're screwed in many other ways. > > What hinders you from running "apt-get update"? > > Don't have root access at work -- I used virtual machine at home to > reproduce the bug. And your admin never installs updates? But of course you can apply my UML workaround patch before building UML. Such that you can have a "works at work UML". :) > I understand that my use case is an edge case at best -- I need to build, on > gentoo, a custom debian bootstrap without having root while preserving all > UID's/GID's and having ability to mount /proc -- UML is the best candidate > for this because it lets me do whatever I want and then send the final > tar.gz via hostfs. > > Maybe there's a cleaner way rather than posted workaround? For example, does > the hostfs really need to call seekdir()/telldir() when the userspace > program is just calling readdir()? No. UML has to seekdir(). It has to honour the VFS directory cursor. I understand that pushing my workaround to mainline would help you in your specific use case. But I have to maintain UML for everyone. Pushing workarounds instead real fixes is not the right approach.
BTW: UML is only one victim of this JFS issue. It is a very subtle issue and not obvious to identify. I'm sure other programs will sooner or later also hit it.
(In reply to Richard Weinberger from comment #25) > I understand that pushing my workaround to mainline would help you in your > specific use case. But I have to maintain UML for everyone. Pushing > workarounds instead real fixes is not the right approach. I understand that perfectly, never treated posted patch as something worthy of pushing. > No. UML has to seekdir(). It has to honour the VFS directory cursor. That's where lack of knowledge on linux internals fails me -- I see that hostfs_readdir() is calling read_dir() in the loop while saving the position into the context, but the loop isn't broken anywhere else, and read_dir() is called only from one place, so the saving could be postponed after the loop is done? Unless dir_emit() returns false before we reach the end of the host directory and have to continue later.
The VFS calls hostfs_readdir() with a directory context. The interesting line is: next = ctx->pos; VFS is allowed to call hostfs_readdir() with different offsets, therefore UML has to seekdir() to ctx->pos before calling readdir() on the hostside.
Oh, I see. I'm out of ideas, then. The workaround is definitely not production quality and would raise more questions from anyone who's not familiar with the issue. I'll have to keep it with me until the host kernel is updated in my own case. Is there some way to let other people know about this when they stumble into same situation while using UML? I'm still uncertain about the need to call telldir()/seekdir() on every entry, but changing that would be hiding the issue into even more rare situation rather than fixing it, since nothing is stopping VFS from calling with problematic directory entry as a parameter. Either way, thank you for looking into this bug report. I honestly expected it to be completely ignored and fall into oblivion when I was filing it.
(In reply to Eugene Bujak from comment #29) > Oh, I see. I'm out of ideas, then. > > The workaround is definitely not production quality and would raise more > questions from anyone who's not familiar with the issue. I'll have to keep > it with me until the host kernel is updated in my own case. > > Is there some way to let other people know about this when they stumble into > same situation while using UML? UML has a nice mailinglist. But I don't expect many people to hit that issue as you need a) a broken stable kernel and b) JFS. JFS has not that much users these days. > I'm still uncertain about the need to call telldir()/seekdir() on every > entry, but changing that would be hiding the issue into even more rare > situation rather than fixing it, since nothing is stopping VFS from calling > with problematic directory entry as a parameter. I'm sure we can optimize the hostfs readdir code. All we have to make sure it that the VFS directory cursor is honoured. Patches are welcome. :-) > Either way, thank you for looking into this bug report. I honestly expected > it to be completely ignored and fall into oblivion when I was filing it. Thanks for reporting.
Created attachment 171991 [details] reworked hostfs readdir Can you please give the following patch a try? It reduces the number of system calls in hostfs' readdir and as a side effect UML should no longer trip over the JFS issue.
Checked it, ls -l and mc work fine.
(In reply to Eugene Bujak from comment #32) > Checked it, ls -l and mc work fine. Okay. I'll queue this patch for v4.1.