Bug 203973

Summary: VFAT filesystem causes errors when opened through Calibre and disconnects device
Product: File System Reporter: Chris Braun (cryzed)
Component: FAT/VFAT/MSDOSAssignee: OGAWA Hirofumi (hirofumi)
Status: CLOSED WILL_NOT_FIX    
Severity: high CC: ametzler, andrew.david.strong, bbbush, cryzed, kovid, muzcuk, zzam
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: linux-lts 4.19.51 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: strace -f -o log /usr/bin/calibre on linux-lts 4.19.51
dmesg output (with UFW spam removed)
sudo strace -p 991 -o log (running /usr/lib/udisks2/udisksd daemon)
strace -v -s 100000 -f -p <udisk pid> -o log.txt
revert changes to fs/fat/file.c to state of linux-lts 4.19.50
debug patch for no medium error
add nobarrier option

Description Chris Braun 2019-06-24 12:38:48 UTC
This happens with two different Amazon Kindle models (Voyage and Oasis 2 on firemware v5.11.2). Just mounting the Kindle filesystem regularly seems to work, however when the e-book managing software Calibre (v3.44.0) is used to mount or open (preview) the device, the device disconnects immediately with the following output in dmesg:

[ 3482.081778] VFS: busy inodes on changed media or resized disk sdf
[ 3482.159247] print_req_error: I/O error, dev sdf, sector 1
[ 3482.159264] FAT-fs (sdf): bread failed in fat_clusters_flush
[ 3482.159364] print_req_error: I/O error, dev sdf, sector 0
[ 3482.159391] FAT-fs (sdf): unable to read boot sector to mark fs as dirty

This doesn't happen with linux-lts 4.19.50. I'm not sure if Calibre is at fault or if there's an issue with the changes introduced in linux-lts 4.19.51. I am running Arch Linux.
Comment 1 Chris Braun 2019-06-24 12:44:42 UTC
After checking the changelog here: https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.19.51 the change seems to be commit 5b6619b4d206662f0f6df20fa660f468f56bb75d.
Comment 2 OGAWA Hirofumi 2019-06-24 13:12:44 UTC
[ 3482.081778] VFS: busy inodes on changed media or resized disk sdf

This is usually error of changed device while fs is busy.
Well, can you check if this issue is fixed by using 4.19.50 or not?

And I don't know what calibre does for fatfs mount, can you provide
"strace -f -o log" to see what calibre does?
Comment 3 Chris Braun 2019-06-24 13:17:07 UTC
The bug is fixed by using 4.19.50 -- sorry, I thought I made that clear. I downgraded the kernel repeatedly until it worked properly again, and that release was the first one. I will upgrade to 4.19.51 temporarily and do what you suggested, the logs will follow soon.
Comment 4 Chris Braun 2019-06-24 13:27:36 UTC
Created attachment 283413 [details]
strace -f -o log /usr/bin/calibre on linux-lts 4.19.51

I had to compress the log, since unpacked it is ~95mb.
Comment 5 OGAWA Hirofumi 2019-06-24 15:34:39 UTC
While running strace, the issue was reproduced? calibre looks like doesn't
mount the device, at least, directly.

Can you show full dmesg log when the issue reproduced?
Comment 6 Chris Braun 2019-06-24 15:50:16 UTC
Created attachment 283417 [details]
dmesg output (with UFW spam removed)

Yes, while running strace the issue was reproduced. I then immediately afterwards Ctrl+C strace and the program closed. I'll attach the dmesg output with the issue reproduced on linux-lts 4.19.51.

To clarify: I can mount the device regularly, without issues (through my filemanager or manually). However, if I open Calibre, after it has already been mounted, Calibre does something with the device that causes this error. If I don't mount the device before I open Calibre, Calibre first mounts the device and then causes the error.
Comment 7 OGAWA Hirofumi 2019-06-25 00:57:12 UTC
Thanks.

Calibre looks like asking to do mount to UDisk2. So, can you strace udisk processes too?

Well, looks like this is not fatfs issue. To test, it might be good to
revert fatfs part that changed in 4.19.51.
Comment 8 Chris Braun 2019-06-25 10:02:52 UTC
I'm not sure how to go about stracing the udisk processes, I figured this would be started by Calibre and then included in strace. Currently I'm running 4.19.50, so there's no need to revert that part and recompile the kernel (for me). I'm not sure how to best help debug this issue otherwise.
Comment 9 Chris Braun 2019-06-25 10:04:25 UTC
Nevermind, I understood you now, I am supposed to revert back to 4.19.51 with the changed fatfs code and run strace on any udisk processes I find. I also just found the -p (PID) switch, so I'll try this.
Comment 10 Chris Braun 2019-06-25 10:11:44 UTC
Created attachment 283425 [details]
sudo strace -p 991 -o log (running /usr/lib/udisks2/udisksd daemon)
Comment 11 Chris Braun 2019-06-25 10:13:21 UTC
Comment on attachment 283425 [details]
sudo strace -p 991 -o log (running /usr/lib/udisks2/udisksd daemon)

This is the strace from the 4.19.51. I attached the Kindle to the computer, did not mount it beforehand, strace'd the running UDisk2 daemon and then started Calibre. I waited until the bug occurs (the error shows up in dmesg and the device disconnects unintentionally) and then quit strace.
Comment 12 OGAWA Hirofumi 2019-06-25 10:38:57 UTC
Oops, sorry. I should mention strace options too. Well, strace says about /dev/sdf though. It forks child threads.

So can you strace child threads too?

# strace -f -o log -p <udisk pid>

Thanks.
Comment 13 OGAWA Hirofumi 2019-06-25 10:47:42 UTC
please use the following instead.

# strace -v -s 100000 -f -p <udisk pid>
Comment 14 Chris Braun 2019-06-25 10:58:22 UTC
Created attachment 283433 [details]
strace -v -s 100000 -f -p <udisk pid> -o log.txt

Same deal as before, attached Kindle, started strace, waited for error to occur and then Ctrl+C.

By the way, once we hopefully figure out what's causing this would it be possible to delete (all) my attachments from this issue? There's quite a bit of information in them that I don't want to be public.
Comment 15 OGAWA Hirofumi 2019-06-25 14:46:04 UTC
  12023 mount("/dev/sdf", "/run/media/user/Kindle" ...
  ...
  991   recvmsg(... "ACTION=change\0DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5:1.0/host6/target6:0:0/6:0:0:0/block/sdf\0SUBSYSTEM=block\0DISK_MEDIA_CHANGE=1\0DEVNAME=/dev/sdf ...)
  ...
  1015  openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5:1.0/host6/target6:0:0/6:0:0:0/block/sdf/size"...
  1015  read(16, "0\n", 4096)             = 2
  ...
  12038 umount2("/run/media/user/Kindle", MNT_DETACH <unfinished ...>


looks like mount is succeed, but got media change event somehow, and
device seems to be disconnected as you says. So umount the mount point.

This should be happening under vfat layer. Maybe, usb layer or such.

Can you revert fat changes in 4.19.51 and rebuild kernel, then test?
(copy fs/fat/file.c from 4.19.50 to 4.19.51)

> delete (all) my attachments from this issue?

Sorry, I don't know how to remove attachments. Please ask to bugzilla admin.
Comment 16 Chris Braun 2019-06-25 15:01:20 UTC
> Sorry, I don't know how to remove attachments. Please ask to bugzilla admin.
I contacted the Helpdesk's e-mail, since I don't know who is administrating the bugtracker.

> Can you revert fat changes in 4.19.51 and rebuild kernel, then test?
I'll do this, but it might take me a bit.
Comment 17 Konstantin Ryabitsev 2019-06-25 15:05:50 UTC
The content of attachment 283433 [details] has been deleted for the following reason:

Requested by the user
Comment 18 Konstantin Ryabitsev 2019-06-25 15:06:13 UTC
The content of attachment 283425 [details] has been deleted for the following reason:

Requested by the user
Comment 19 Konstantin Ryabitsev 2019-06-25 15:06:27 UTC
The content of attachment 283417 [details] has been deleted for the following reason:

Requested by the user
Comment 20 Konstantin Ryabitsev 2019-06-25 15:06:45 UTC
The content of attachment 283413 [details] has been deleted for the following reason:

Requested by the user
Comment 21 Chris Braun 2019-06-25 17:42:41 UTC
Created attachment 283437 [details]
revert changes to fs/fat/file.c to state of linux-lts 4.19.50

As requested, I built the 4.19.51 kernel with the changes made to fs/fat/file.c reverted to the state of the 4.19.50 release: I can confirm this fixes the issue. It's definitely those changes that lead to the unexpected behavior.
Comment 22 OGAWA Hirofumi 2019-06-26 00:25:53 UTC
Created attachment 283439 [details]
debug patch for no medium error

Hm, interesting. That just changes the order of functions.

Can you re-test with applying this debug patch, and send dmesg after test?
If you don't want to public the log, it's ok to send email to me directly.

hirofumi@mail.parknet.co.jp
Comment 23 OGAWA Hirofumi 2019-06-26 00:27:07 UTC
BTW, debug patch is for vanila-4.19.51.
Comment 24 Chris Braun 2019-06-26 12:49:26 UTC
I'm compiling it right now and will report back in a bit.
Comment 25 Chris Braun 2019-06-26 13:14:19 UTC
I sent you the log via e-mail.
Comment 26 OGAWA Hirofumi 2019-06-26 14:05:17 UTC
[   47.271876] fat_file_fsync:204: path <filename>
[   47.272634] fat_file_fsync:207: err 0
[   47.273305] fat_file_fsync:212: err 0
[   47.281683] fat_file_fsync:217: err 0
...
[   48.232901] fat_file_fsync:204: path <filename2>
[   48.265533] fat_file_fsync:207: err 0
[   48.266637] fat_file_fsync:212: err 0
[   48.491774] fat_file_fsync:217: err 0
...
[   52.310349] media_not_present:1531: sense 2, asc

fatfs operations succeed without any error. And after 4secs of that fsync path,
The kindle returns NOT_READY (MEDIUM NOT PRESENT) to TEST UNIT READY command.
This kindle behavior is root cause of this issue.

However I have no idea, why kindle returns this, and why fatfs changes affect
to this.

It might be better to ask usb guys too, with usbmon trace.
Comment 27 OGAWA Hirofumi 2019-06-26 14:08:06 UTC
Can you get usbmon data for kindle access? (if you don't know about
usbmon, please ask. BTW, CONFIG_USB_MON=y or m is required)
Comment 28 Chris Braun 2019-06-26 15:09:36 UTC
I sent you the usbmon data per e-mail. I triggered the error while monitoring it.
Comment 29 OGAWA Hirofumi 2019-06-26 16:20:08 UTC
Ouch, I should mentioned about tcpdump for usbmon. Please take usbmon data by
pcap format. With pcap format, wireshark can decode usb packet data.
Comment 30 OGAWA Hirofumi 2019-06-26 16:30:16 UTC
For example,

    # tcpdump -i usbmon0 -w usbmon.pcap

("usbmon0" might have to adjust for your environment.)
Comment 31 Chris Braun 2019-06-26 19:12:29 UTC
The file should have arrived per e-mail.
Comment 32 OGAWA Hirofumi 2019-06-27 10:49:27 UTC
 4203   9.160230         host → 1.25.1       USBMS 95 SCSI: Synchronize Cache(10) LUN: 0x00 (LBA: 0x00000000, Len: 0)
 4206   9.164911       1.25.1 → host         USBMS 77 SCSI: Response LUN: 0x00 (Synchronize Cache(10)) (Good)
[...]
 7323  10.202167         host -> 1.25.1       USBMS 95 SCSI: Synchronize Cache(10) LUN: 0x00 (LBA: 0x00000000, Len: 0)
 7326  10.432266       1.25.1 -> host         USBMS 77 SCSI: Response LUN: 0x00 (Synchronize Cache(10)) (Good)
 7327  10.769092         host -> 1.25.1       USBMS 95 SCSI: Test Unit Ready LUN: 0x00 
 7330  10.769192       1.25.1 -> host         USBMS 77 SCSI: Response LUN: 0x00 (Test Unit Ready) (Good)
 7335  12.849093         host -> 1.25.1       USBMS 95 SCSI: Test Unit Ready LUN: 0x00 
 7338  12.849206       1.25.1 -> host         USBMS 77 SCSI: Response LUN: 0x00 (Test Unit Ready) (Check Condition)
 7339  12.849209         host -> 1.25.1       USBMS 95 SCSI: Request Sense LUN: 0x00 

This is the simplified usbmon trace. First flush succeed and second
flush too, then the followed TUR is succeed too. Until this point, all
is normal. 

But after 2 secs, second TUR returns error, and issues auto-SENSE to
check details (SENSE returns NOT_READY + MEDIUM NOT PRESENT).
[I have no idea why 2secs delay returns error from the device.]

TUR may be issued by kernel periodical checker. Can you check output of
the following command?

    # cat /sys/module/block/parameters/events_dfl_poll_msecs

And then, this command will disable TUR checker. If run this command before running calibre, does this change the behavior?

    # echo 0 > /sys/block/sdf/events_poll_msecs

Please get usbmon again with this config change.

[
    # echo -1 > /sys/block/sdf/events_poll_msecs

 will revert the above change]
Comment 33 Chris Braun 2019-06-27 13:21:31 UTC
[root@host user]# cat /sys/module/block/parameters/events_dfl_poll_msecs
2000

The new usbmon capture should have arrived per e-mail. Setting events_poll_msecs=0 using the command does not change the behavior.
Comment 34 OGAWA Hirofumi 2019-06-27 13:32:33 UTC
There is no TEST UNIT READY command as expected, and there is no any commands
to kindle in usbmon capture.

[ 3482.081778] VFS: busy inodes on changed media or resized disk sdf
[ 3482.159247] print_req_error: I/O error, dev sdf, sector 1
[ 3482.159264] FAT-fs (sdf): bread failed in fat_clusters_flush
[ 3482.159364] print_req_error: I/O error, dev sdf, sector 0
[ 3482.159391] FAT-fs (sdf): unable to read boot sector to mark fs as dirty

After events_poll_msecs=0, this message really outputed after ran calibre?
Comment 35 Chris Braun 2019-06-27 15:52:44 UTC
I'll try again.
Comment 36 Chris Braun 2019-06-27 15:58:02 UTC
Please check the latest e-mail, containing "dmesg.txt" and "usbmon1.pcap". I am pretty sure I did it correctly, yes.
Comment 37 OGAWA Hirofumi 2019-06-27 16:41:17 UTC
Created attachment 283463 [details]
add nobarrier option

OK, thank you.

I confirmed the following line was gone,

    VFS: busy inodes on changed media or resized disk sdf

but I/O still becomes error.

FLUSH command may affect to kindle somehow. So, attached patch adds "nobarrier"
mount option to fatfs.

Can you run same test with "mount -o nobarrier" mount, and see result?
Comment 38 Chris Braun 2019-06-27 17:02:12 UTC
I'm compiling the linux-lts 4.19.51 with fat-debug.patch and fat-nobarrier.patch and then I'll test this, however this might take me a while.
Comment 39 Chris Braun 2019-06-27 17:49:44 UTC
I had to add "/dev/sdf /run/media/user/Kindle vfat defaults,nobarrier 0 2" to my fstab, so that Calibre would mount it through udisk with the correct options. I confirmed that this worked by checking "mount" and seeing this line: "/dev/sdf on /run/media/user/Kindle type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso88591,shortname=mixed,utf8,errors=remount-ro,nobarrier)". However now, Calibre won't recognize the device and "open" it -- the error won't even occur now.
Comment 40 OGAWA Hirofumi 2019-06-28 01:04:22 UTC
If added that line to /etc/fstab, udisks2 use mount options exactly that is
specified by /etc/fstab.

So, to allow normal user access, you would have to use something like,

    /dev/sdf /run/media/user/Kindle vfat defaults,user,nobarrier,showexec,utf8,flush,nobarrier
Comment 41 OGAWA Hirofumi 2019-06-28 01:11:52 UTC
maybe didn't work as intended. probably needs uid=,gid=

    /dev/sdf /run/media/user/Kindle vfat defaults,user,nobarrier,showexec,utf8,flush,nobarrier,uid=<uid name>,gid=<gid name>
Comment 42 OGAWA Hirofumi 2019-06-28 01:25:31 UTC
read a source of udisks2 a bit, x-udisks-auth may work for this purpose

/dev/sdf /run/media/user/Kindle vfat defaults,user,nobarrier,showexec,utf8,flush,nobarrier,x-udisks-auth
Comment 43 Chris Braun 2019-06-28 10:17:16 UTC
Ah sorry about that! I thought the requested options and those in /etc/fstab might be merged or something, or I might have simply forgotten about the access rights.

I can confirm that when adding "/dev/sdf /run/media/user/Kindle vfat defaults,nobarrier,uid=1000,gid=1000 0 2" to /etc/fstab and then opening Calibre (letting it mount and "open/preview" the device), the bug doesn't occur anymore due to the nobarrier option!

To make sure I also tested once with "/dev/sdf /run/media/user/Kindle vfat defaults,uid=1000,gid=1000 0 2" and the bug occurred again.

Do you still need the usbmon data? It seems you've figured out the cause at least.
Comment 44 OGAWA Hirofumi 2019-06-28 11:42:38 UTC
Thank you. No need usbmon anymore.

I think kindle suspend or disconnect if it got sync then there is idle time.
Just a guess though, kindle may be trying to detect safe-disconnect of windows
or such.

To workaround this and provide flexibility, I will add nobarrier option to fat.

To config this automatically, user space daemon/app has to notice and need to
change though.

If the above guess is true,

    $ sync /run/media/user/Kindle/driveinfo.calibre

then wait for a while, kindle disconnect? (further access will become I/O error)
Comment 45 Chris Braun 2019-06-28 11:47:25 UTC
> then wait for a while, kindle disconnect? (further access will become I/O
> error)

Yes, that's correct, the Kindle disconnects without the "nobarrier" option if I do this.

> To config this automatically, user space daemon/app has to notice and need to
> change though.

So you're saying UDisk will see that the device is a Kindle device, and then automatically mount it with the nobarrier option? That seems sensible, because as it stands, I won't be the only person having this issue. I'm honestly quite puzzled why I am (seemingly) the only one reporting it.

Any ideas what the next steps are? Are you going to notify the appropriate developers about this change? I'm not really sure who to talk to.
Comment 46 OGAWA Hirofumi 2019-06-28 11:56:31 UTC
> So you're saying UDisk will see that the device is a Kindle device, and then
> automatically mount it with the nobarrier option? That seems sensible,
> because > as it stands, I won't be the only person having this issue. I'm
> honestly quite > puzzled why I am (seemingly) the only one reporting it.

Right, I'm not thinking deeply what is best way though. 
It might be good to tell udisks developer to allow "nobarrier" option in
whitelist (whitelist is hardcoded in udisks :/).

If "nobarrier" is in whiltelist, calibre itself can control "nobarrier" option
within udisks request packet. (Or udisks can add always)

Please contact what developers you need, I have no connection to those 
developers too.


I myself going to submit this nobarrier option patch to kernel, and backport.
Comment 47 Chris Braun 2019-06-28 12:48:37 UTC
Thanks for all your patience and effort!
Comment 48 OGAWA Hirofumi 2019-06-28 17:25:08 UTC
While discussing other kernel developer, found another options

   echo "write through" > /sys/block/<dev>/queue/write_cache

or

   echo "temporary write through" > /sys/block/<dev>/device/scsi_disk/<device path>/cache_type

(<dev> and <device path> has to change for target device)

This options might be better. All disables flush command.
Comment 49 Chris Braun 2019-06-29 16:30:09 UTC
This still feels strange in retrospect. Either this is Kindle-specific and the Kindle is at fault -- and that's fine then. Otherwise, if this applies to all VFAT mounted filesystems now, (file activity is required after a fsync otherwise the device is ejected/filesystem unmounted) then this is clearly a regression in the Kernel and this manual workaround with disabling fsync is not the ideal solution. What about reverting the change to the VFAT code in the kernel?
Comment 50 OGAWA Hirofumi 2019-06-30 02:47:40 UTC
I don't know if kindle specific or not though, this is first one that reported.
Of course, other devices I know work fine with change or without change.

And unfortunately, reverting change doesn't fix kindle fully.

Old is,

    write data/inode
    flush
    write fat area

New is

    write data/inode
    write fat area
    flush

New one is right way for other devices. And even if reverted change,

    $ touch /run/media/user/Kindle/driveinfo.calibre
    $ sync /run/media/user/Kindle/driveinfo.calibre

with above pattern, kindle will also disconnect. ("touch" doesn't dirty
the fat area.)

[BTW, nobarrier doesn't disable fsync fully, disable only flush.]
Comment 51 Kovid Goyal 2019-07-04 05:07:26 UTC
touch is not sufficient. In calibre I worked around this by creating a dummy, file, writing a few bytes to it and deleting the file after every fsync.

The Kindle is a pretty widespread device. If it really does disconnect on seeing no activity after a flush, and this change remains in the kernel, it's going to cause a lot of breakage. 

To me the obvious solution is to follow the FLUSH with some harmless keep alive type fat area activity, a simple read might be sufficient, though  I have no tested it.

As far as calibre is concerned, I have already worked around it, but, there needs to be a proper fix, either in the kernel or the kindle firmware, which also uses linux, IIRC.

And just to simplify testing exactly what activity works to prevent the eject, here is a one-liner

python -c "p = '/run/media/kovid/Kindle/driveinfo.calibre'; f = open(p, 'r+b'); os.fsync(f.fileno());"

This will cause an eject. You can add whatever you like after the call to the fsync to figure out what minimal activity the Kindle requires after flush. Replace the path to the moutpoint of the kindle as appropriate for your system.
Comment 52 Kovid Goyal 2019-07-04 05:21:23 UTC
yes, as I thought, a read is sufficient. So the following will cause an eject:

python -c "import os; f = open('/run/media/kovid/Kindle/driveinfo.calibre', 'r+b'); os.fsync(f)"

but the following will not:

python -c "import os; f = open('/run/media/kovid/Kindle/driveinfo.calibre', 'r+b'); os.fsync(f); f.read(1)"

Sadly, doing a stat on the file is not sufficient, since I guess it does not involve the fat area
Comment 53 Kovid Goyal 2019-07-04 05:29:11 UTC
utime() also seems to work to prevent ejects.
Comment 54 OGAWA Hirofumi 2019-07-04 05:47:26 UTC
> touch is not sufficient. In calibre I worked around this by creating a dummy,
> file, writing a few bytes to it and deleting the file after every fsync.

touch is example of how to reproduce disconnect. Not to prevent.

> The Kindle is a pretty widespread device. If it really does disconnect on
> seeing no activity after a flush, and this change remains in the kernel, it's
> going to cause a lot of breakage. 

Again, as I said, this happens with or without that change.

And your workaround is depending on cached or not cached.
If reproduced even if flush was disabled, let me know.
Comment 55 Kovid Goyal 2019-07-04 06:22:02 UTC
I think it is pretty clear that the Kindle does not disconnect without this change. I have been connecting Kindles to Linux computers without disconnects for over a decade. 

It is also pretty clear that, the kernel producing a FLUSH followed by no activity is going to cause kindles to disconnect. The obvious workaround is to produce some simple activity after the FLUSH. This should ideally be done by the kernel, with some testing to find out what the minimal required activity is. I posted an example with f.read() to show that even a read is sufficient, that is not the actual workaround I used. The actual workaround is to write things to the filesystem, so there is no question of caching. 

Anyway, it's your choice, you can either fix this properly now, before the problem becomes more widespread or deal with all the ensuing bug reports in the future, once kernels with this patch become more widely deployed. Expecting users to turn on device specific mount flags in fstab or every software that writes to such devices to somehow either mount or remount the devices with a device specific flag, is not a realistic solution. Nor is disabling FLUSH a good idea, it breaks the semantics of fsync.

In any case I have worked around this bug in calibre, so my interest in this issue is minimal at best. Feel free to do whatever you like about it.
Comment 56 OGAWA Hirofumi 2019-07-04 06:49:57 UTC
You are not understanding this issue correctly.

fsync is broken without that change, the change is to fix of fsync breakage.
And needless write make reduce liftime of flash device.
Comment 57 Kovid Goyal 2019-07-04 08:29:45 UTC
You are saying that to fix some theoretical fsync breakage that has presumably existed for decades you are willing to have fsync cause devices to eject!!! Do you think that fsync causing a device to eject is not broken? I wonder how every other OS on the planet manages to implement fsync without causing devices to eject.

Yes, needless writes are unfortunate, but thanks to the breakage you have introduced, they are unavoidable on Linux. Everybody that wants to use fsync on ejectable devices now has to follow it up with extra writes to workaround this breakage. Because widely used devices assume that a fsync with no following activity is a reason to eject. I totally understand that this is a bug in the device, but nonetheless, this is a widely used device that has worked well with Linux for a decade, breaking that now, especially when the workaround is so simple, is really unfortunate.
Comment 58 OGAWA Hirofumi 2019-07-04 09:17:27 UTC
> Nor is disabling FLUSH a good idea, it breaks the semantics of fsync.

and now, you say

> You are saying that to fix some theoretical fsync breakage that has
> presumably > existed for decades you are willing to have fsync cause devices
> to eject!!!

If you don't care fsync flush semantics, you can just disable flush to
workaround kindle behavior.

> I wonder how every other OS on the planet manages to implement fsync without
> causing devices to eject.

True. If you can provide usbmon on other OS, it would be helpful info.

> Yes, needless writes are unfortunate, but thanks to the breakage you have
> introduced,

As said, you can reproduce this issue without that change.
Comment 59 Kovid Goyal 2019-07-04 09:24:44 UTC
As I have already demonstrated ALL you need to do to fix this issue is do a READ on the device after every FLUSH. Just make sure it bypasses the cache, which you should be able to do since the cache is maintained by the kernel. 

In fact, probably a READ is not really necessary, simply making some kind of query to the device will be sufficient, but a READ is simple and side effect free.
Comment 60 OGAWA Hirofumi 2019-07-04 09:41:38 UTC
Can't add penalty to all devices for one device is having strange behavior.

If you prefer to use read command, you can do it with direct I/O without
kernel help, and maybe possible to carefully use flush command too.

> In fact, probably a READ is not really necessary

At least, TUR doesn't work.
Comment 61 Kovid Goyal 2019-07-04 09:53:12 UTC
No, I am fine with using utime(). calibre only uses fsync() after writing a full file, so the extra writes caused by utime() are negligible in comparison. So, for calibre at least utime() is a reasonable workaround. My posts were not about calibre, but about the general problems that will be caused if such a widely used device suddenly start disconnecting seemingly randomly from linux machines. But what you want to do about that is your call.
Comment 62 OGAWA Hirofumi 2019-07-04 10:02:13 UTC
> My posts were not about calibre, but about the general problems that will be
> caused if such a widely used device suddenly start disconnecting seemingly
> randomly from linux machines.

Right, it is why trying to provide the option to disable flush command as
relatively flexible workaround with above patch.
Comment 63 Kovid Goyal 2019-07-04 10:11:19 UTC
Yeah but the nobarrier option is going to require users to know to set it. Or to have all software maintain a database of devices that require it (I highly doubt kindle is the only one). Neither option is really workable.

Not to mention that nobarrier basically breaks fsync completely, since now fsync no longer means data is actually written to the device. For example, calibre uses fsync so that if the user happens to pull the cable after transferring files to the device, without unmounting, the transferred files wont be corrupted. This simply will not work with nobarrier.

Similarly any database software that relies on the semantics of fsync will also not work with nobarrier, leading to data loss.
Comment 64 OGAWA Hirofumi 2019-07-04 10:27:49 UTC
True.

Reverting that change breaks fsync flush semantics like you care now.
Correct semantics by that change hitted the behavior of kindle disconnect.

Whether there is real issue or not, is depending on in-device cache handling
when lost power though (supercapacitor can help it).

The "nobarrier" option is similar behavior like before that change. Make sure
there is no disconnect, but fsync's flush command is broken.
Comment 65 Kovid Goyal 2019-07-04 10:32:26 UTC
I'm not suggesting you revert the change. I'm saying figure out what a minimal USB keep-alive command is that prevents the kindle from disconnecting and just send it after FLUSH. It's fairly obvious that all other OSes do this, otherwise the Kindle would never have been designed with this behavior.

I am sympathetic with the desire to not penalize other devices for the kindle, but given that all other OSes do this there must be some USB command that has extremely low overhead that is sent after FLUSH by other OSes. If Linux did the same, I think that is the best solution.
Comment 66 OGAWA Hirofumi 2019-07-04 10:44:51 UTC
It's why I asked to get usbmon by other OS what is doing.

And if it is not fat layer stuff, I can't decide anymore, would have to report
and ask to usb developer or such.
Comment 67 OGAWA Hirofumi 2019-09-27 19:38:34 UTC
Unfortunately, nobarrier patch was rejected.

So, the workaround would be

   echo "write through" > /sys/block/<dev>/queue/write_cache

(<dev> has to change for target device)

This disables flush command, and udev rule would be able to do this automatically
Comment 68 Matthias Schwarzott 2021-03-14 13:14:22 UTC
As testing shows, enough TEST UNIT READY commands (after SYNCHRONIZE CACHE) make the Kindle not disconnect.

SO the easiest solution is to set events_poll_msecs low enough.
Testing showed that values up until 950 seem to work.

I suggest this udev rule and will submit it:
ACTION=="add", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", \
  SUBSYSTEMS=="usb", ATTRS{idVendor}=="1949", ATTRS{idProduct}=="0004", \
  ATTR{events_poll_msecs}="800"
Comment 69 Matthias Schwarzott 2021-04-06 15:38:22 UTC
Meanwhile the issue has been fixed by commit 546aa0e4ea6ed81b6c51baeebc4364542fa3f3a7 (usb-storage: Add quirk to defeat Kindle's automatic unload),
see https://github.com/torvalds/linux/commit/546aa0e4ea6ed81b6c51baeebc4364542fa3f3a7

It basically creates a new quirk (applied to Kindle) that sends a TEST UNIT READY command directly after a SYNCHRONIZE CACHE command.

The fix will be in kernel 5.12.
It also has also been backported to some stable series:
* 4.14
* 4.19
* 5.4
* 5.10
* 5.11