Bug 94741 - hostfs goes into endless loop in readdir/opendir if host fs is jfs
Summary: hostfs goes into endless loop in readdir/opendir if host fs is jfs
Status: ASSIGNED
Alias: None
Product: File System
Classification: Unclassified
Component: JFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Dave Kleikamp
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-11 13:32 UTC by Eugene Bujak
Modified: 2016-03-20 10:17 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.19.1
Subsystem:
Regression: No
Bisected commit-id:


Attachments
reproducer (432 bytes, text/x-csrc)
2015-03-19 22:14 UTC, Richard Weinberger
Details
ugly workaround (404 bytes, patch)
2015-03-19 22:15 UTC, Richard Weinberger
Details | Diff
jfs: fix readdir regression in stable trees (1.23 KB, patch)
2015-03-21 21:55 UTC, Dave Kleikamp
Details | Diff
reworked hostfs readdir (2.55 KB, patch)
2015-03-24 14:51 UTC, Richard Weinberger
Details | Diff

Description Eugene Bujak 2015-03-11 13:32:04 UTC
How to reproduce:

1. Create jfs for loopback:

mkfs.jfs /wheezy_fs

2. Mount it:

mkdir /wheezy; mount -o loop /wheezy_fs /wheezy

3. Put debootstrapped wheezy there:

debootstrap --arch=i386 wheezy /wheezy

4. Compile UML:

cd $HOME/compile/linux-3.19.1/
make ARCH=um defconfig
make ARCH=um menuconfig -- disable 64-bit kernel, enable static linking, disable loadable module support
make ARCH=um -j4

5. Run UML:

mkdir -p $HOME/tmp
TMPDIR=$HOME/tmp ./linux rootfstype=hostfs rw init=/bin/bash rootflags=/wheezy/

6. run ls -l

ls -l

At this point, if you attach a debugger, you'll see that it's endlessly looping in hostfs_readdir() by calling read_dir().

Adding printk() for *pos and ent->d_name in read_dir() shows that *pos doesn't advance beyond first entry and stays the same.

Since I observe that read_dir() is used only in hostfs_readdir(), I removed calls to seekdir() and telldir(), this fixed this endless loop, through printk() I see that it gets full list of names there, but now hostfs_readdir() is being called in endless loop and eventually kernel panic happens.

This is beyond my knowledge of linux kernel and I hope this little investigation is useful enough to find the root cause of the problem or at least find a way to work around that.

This behaviour is observed only if host's fs is jfs -- ext4 and btrfs behave fine.
Comment 1 Eugene Bujak 2015-03-11 15:58:07 UTC
To speed up reproduction, linux config, minimal jfs filesystem image and linux binary are here -- https://hmage.net/files/bug94741.tar.gz (37.4MB)
Comment 2 Richard Weinberger 2015-03-19 18:27:22 UTC
Pushed onto my TODO!
Thanks a lot for reporting!
Comment 3 Richard Weinberger 2015-03-19 18:43:17 UTC
(In reply to Richard Weinberger from comment #2)
> Pushed onto my TODO!
> Thanks a lot for reporting!

BTW: Just gave it a try. I cannot reproduce the issue.
I'm sure it depends on the host kernel. What is your version?
Comment 4 Eugene Bujak 2015-03-19 19:01:05 UTC
Host kernel is this:

Linux vhmd 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64 GNU/Linux

This is default kernel for debian wheezy.
Comment 5 Eugene Bujak 2015-03-19 21:43:14 UTC
Installed 3.19.1 on host machine, does not reproduce anymore.

Is there a way to find a workaround for older kernels? I'm using UML as a virtualization to run debootstrap without root privileges on host machine. For same reason I can't change jfs to anything else.
Comment 6 Richard Weinberger 2015-03-19 21:44:39 UTC
I'm currently investigating what exactly is going on. Maybe a JFS bugfix needs backporting or UML's hostfs relies on a sideeffekt and need fixing.
Comment 7 Eugene Bujak 2015-03-19 21:53:16 UTC
Much appreciated.

As a sidenote, not every UML user will be able to change host's kernel (one of the reasons why I'm using UML in first place), it would be more usable if changes inside UML were sufficient to fix the bug — ls -l on host's jfs works fine.
Comment 8 Richard Weinberger 2015-03-19 22:14:41 UTC
Created attachment 171231 [details]
reproducer
Comment 9 Richard Weinberger 2015-03-19 22:15:01 UTC
Created attachment 171241 [details]
ugly workaround
Comment 10 Richard Weinberger 2015-03-19 22:15:09 UTC
It is definitely a JFS issue, please see readdir.c. This small C program will reproduce the issue.
Can you also please test the attached patch for UML? It should workaround the issue.
Comment 11 Eugene Bujak 2015-03-19 22:32:26 UTC
readdir.c reveals the bug with assert failure, as exepcted.

Attached patch works around the bug, ls -l under UML now works.
Comment 12 Eugene Bujak 2015-03-19 22:39:17 UTC
Found the original JFS bug in question, it's bug 60737.
Comment 13 Richard Weinberger 2015-03-19 22:41:45 UTC
(In reply to Eugene Bujak from comment #12)
> Found the original JFS bug in question, it's bug 60737.

I saw that too. If I revert the said fix ontop of mainline its still works perfectly fine. I'm currently digging into the 3.2 stable tree. Maybe the
#60737 got backported in a wrong manner...
Comment 14 Richard Weinberger 2015-03-19 22:55:55 UTC
(In reply to Richard Weinberger from comment #13)
> (In reply to Eugene Bujak from comment #12)
> > Found the original JFS bug in question, it's bug 60737.
> 
> I saw that too. If I revert the said fix ontop of mainline its still works
> perfectly fine. I'm currently digging into the 3.2 stable tree. Maybe the
> #60737 got backported in a wrong manner...

Indeed. Mainline commit 44512449c0ab368889dd13ae0031fba74ee7e1d2 backported to 3.2 is the root cause.
But I fail to see why. I'll investigate tomorrow with a fresh brain, for now I need some sleep.
Comment 15 Dave Kleikamp 2015-03-21 21:52:12 UTC
Richard,
Thanks for bringing this to my attention.

I found the problem. The position passed to the filldir() function needs to be incremented and was missed. In mainline, the dir_emit() function does not take a position argument, which is how we missed it.
Comment 16 Dave Kleikamp 2015-03-21 21:55:31 UTC
Created attachment 171591 [details]
jfs: fix readdir regression in stable trees

I'll probably push the patch on Monday after I determine which stable trees need it.
Comment 17 Eugene Bujak 2015-03-23 21:06:17 UTC
Please note -- this bug is about UML's behaviour of ls -l being different under host vs UML. Reassigning this to JFS defeats the original bug report's intent.

On host ls -l works fine, while under UML it does not, while using same host's jfs. 

It would be more logical that the buggy JFS behaviour would happen only if the user space program under UML did readdir/telldir/seekdir the way the attached reproducer does.
Comment 18 Dave Kleikamp 2015-03-23 21:10:36 UTC
Alright. Take the bug back if you'd like to use it for a client-side workaround, althought the root of the problem is a bug in jfs. I'm still pushing the jfs fix to the stable kernel.
Comment 19 Richard Weinberger 2015-03-23 21:14:27 UTC
(In reply to Dave Kleikamp from comment #18)
> Alright. Take the bug back if you'd like to use it for a client-side
> workaround, althought the root of the problem is a bug in jfs. I'm still
> pushing the jfs fix to the stable kernel.

Yes, please push the JFS bugfix to stable. Working around the issue in UML is
not the right solution.
Comment 20 Eugene Bujak 2015-03-23 21:21:02 UTC
Not trying to stop you from pushing the fix.

Just don't want the original issue to be ignored. Old versions of host kernels will stay in the wild for a long while (hello android) while newer UML is never a problem.
Comment 21 Richard Weinberger 2015-03-23 21:24:16 UTC
UML does not work on ARM, so Android is not an issue. :)

The original issue that JFS is buggy on stable 3.2 (not vanilla 3.2).
User got the bug by updating kernel 3.2 to 3.2.x, they will get the fix by updating again.
Comment 22 Eugene Bujak 2015-03-23 21:32:32 UTC
Does that mean inability to update host kernel is not even considered?
Comment 23 Richard Weinberger 2015-03-23 21:35:46 UTC
If you're unable to update your kernel you're screwed in many other ways.
What hinders you from running "apt-get update"?
Comment 24 Eugene Bujak 2015-03-23 22:06:44 UTC
(In reply to Richard Weinberger from comment #23)
> If you're unable to update your kernel you're screwed in many other ways.
> What hinders you from running "apt-get update"?

Don't have root access at work -- I used virtual machine at home to reproduce the bug.

I understand that my use case is an edge case at best -- I need to build, on gentoo, a custom debian bootstrap without having root while preserving all UID's/GID's and having ability to mount /proc -- UML is the best candidate for this because it lets me do whatever I want and then send the final tar.gz via hostfs.

Maybe there's a cleaner way rather than posted workaround? For example, does the hostfs really need to call seekdir()/telldir() when the userspace program is just calling readdir()?
Comment 25 Richard Weinberger 2015-03-23 22:13:12 UTC
(In reply to Eugene Bujak from comment #24)
> (In reply to Richard Weinberger from comment #23)
> > If you're unable to update your kernel you're screwed in many other ways.
> > What hinders you from running "apt-get update"?
> 
> Don't have root access at work -- I used virtual machine at home to
> reproduce the bug.

And your admin never installs updates?
But of course you can apply my UML workaround patch before building UML.
Such that you can have a "works at work UML". :)

> I understand that my use case is an edge case at best -- I need to build, on
> gentoo, a custom debian bootstrap without having root while preserving all
> UID's/GID's and having ability to mount /proc -- UML is the best candidate
> for this because it lets me do whatever I want and then send the final
> tar.gz via hostfs.
> 
> Maybe there's a cleaner way rather than posted workaround? For example, does
> the hostfs really need to call seekdir()/telldir() when the userspace
> program is just calling readdir()?

No. UML has to seekdir(). It has to honour the VFS directory cursor.

I understand that pushing my workaround to mainline would help you in your specific use case. But I have to maintain UML for everyone. Pushing workarounds instead
real fixes is not the right approach.
Comment 26 Richard Weinberger 2015-03-23 22:15:16 UTC
BTW: UML is only one victim of this JFS issue. It is a very subtle issue
and not obvious to identify. I'm sure other programs will sooner or later also hit it.
Comment 27 Eugene Bujak 2015-03-23 22:31:08 UTC
(In reply to Richard Weinberger from comment #25)
> I understand that pushing my workaround to mainline would help you in your
> specific use case. But I have to maintain UML for everyone. Pushing
> workarounds instead real fixes is not the right approach.

I understand that perfectly, never treated posted patch as something worthy of pushing.

> No. UML has to seekdir(). It has to honour the VFS directory cursor.

That's where lack of knowledge on linux internals fails me -- I see that hostfs_readdir() is calling read_dir() in the loop while saving the position into the context, but the loop isn't broken anywhere else, and read_dir() is called only from one place, so the saving could be postponed after the loop is done?

Unless dir_emit() returns false before we reach the end of the host directory and have to continue later.
Comment 28 Richard Weinberger 2015-03-23 22:35:54 UTC
The VFS calls hostfs_readdir() with a directory context. The interesting line is:
next = ctx->pos;
VFS is allowed to call hostfs_readdir() with different offsets, therefore UML has to seekdir() to ctx->pos before calling readdir() on the hostside.
Comment 29 Eugene Bujak 2015-03-23 22:47:53 UTC
Oh, I see. I'm out of ideas, then.

The workaround is definitely not production quality and would raise more questions from anyone who's not familiar with the issue. I'll have to keep it with me until the host kernel is updated in my own case. 

Is there some way to let other people know about this when they stumble into same situation while using UML?

I'm still uncertain about the need to call telldir()/seekdir() on every entry, but changing that would be hiding the issue into even more rare situation rather than fixing it, since nothing is stopping VFS from calling with problematic directory entry as a parameter.

Either way, thank you for looking into this bug report. I honestly expected it to be completely ignored and fall into oblivion when I was filing it.
Comment 30 Richard Weinberger 2015-03-23 22:54:02 UTC
(In reply to Eugene Bujak from comment #29)
> Oh, I see. I'm out of ideas, then.
> 
> The workaround is definitely not production quality and would raise more
> questions from anyone who's not familiar with the issue. I'll have to keep
> it with me until the host kernel is updated in my own case. 
> 
> Is there some way to let other people know about this when they stumble into
> same situation while using UML?

UML has a nice mailinglist.
But I don't expect many people to hit that issue as you need a) a broken stable kernel and b) JFS.
JFS has not that much users these days.

> I'm still uncertain about the need to call telldir()/seekdir() on every
> entry, but changing that would be hiding the issue into even more rare
> situation rather than fixing it, since nothing is stopping VFS from calling
> with problematic directory entry as a parameter.

I'm sure we can optimize the hostfs readdir code. All we have to make sure it that the VFS directory cursor is honoured.
Patches are welcome. :-)

> Either way, thank you for looking into this bug report. I honestly expected
> it to be completely ignored and fall into oblivion when I was filing it.

Thanks for reporting.
Comment 31 Richard Weinberger 2015-03-24 14:51:32 UTC
Created attachment 171991 [details]
reworked hostfs readdir

Can you please give the following patch a try?
It reduces the number of system calls in hostfs' readdir and as a side effect
UML should no longer trip over the JFS issue.
Comment 32 Eugene Bujak 2015-03-24 17:27:49 UTC
Checked it, ls -l and mc work fine.
Comment 33 Richard Weinberger 2015-03-24 23:21:58 UTC
(In reply to Eugene Bujak from comment #32)
> Checked it, ls -l and mc work fine.

Okay. I'll queue this patch for v4.1.

Note You need to log in before you can comment on or make changes to this bug.