This happens with two different Amazon Kindle models (Voyage and Oasis 2 on firemware v5.11.2). Just mounting the Kindle filesystem regularly seems to work, however when the e-book managing software Calibre (v3.44.0) is used to mount or open (preview) the device, the device disconnects immediately with the following output in dmesg: [ 3482.081778] VFS: busy inodes on changed media or resized disk sdf [ 3482.159247] print_req_error: I/O error, dev sdf, sector 1 [ 3482.159264] FAT-fs (sdf): bread failed in fat_clusters_flush [ 3482.159364] print_req_error: I/O error, dev sdf, sector 0 [ 3482.159391] FAT-fs (sdf): unable to read boot sector to mark fs as dirty This doesn't happen with linux-lts 4.19.50. I'm not sure if Calibre is at fault or if there's an issue with the changes introduced in linux-lts 4.19.51. I am running Arch Linux.
After checking the changelog here: https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.19.51 the change seems to be commit 5b6619b4d206662f0f6df20fa660f468f56bb75d.
[ 3482.081778] VFS: busy inodes on changed media or resized disk sdf This is usually error of changed device while fs is busy. Well, can you check if this issue is fixed by using 4.19.50 or not? And I don't know what calibre does for fatfs mount, can you provide "strace -f -o log" to see what calibre does?
The bug is fixed by using 4.19.50 -- sorry, I thought I made that clear. I downgraded the kernel repeatedly until it worked properly again, and that release was the first one. I will upgrade to 4.19.51 temporarily and do what you suggested, the logs will follow soon.
Created attachment 283413 [details] strace -f -o log /usr/bin/calibre on linux-lts 4.19.51 I had to compress the log, since unpacked it is ~95mb.
While running strace, the issue was reproduced? calibre looks like doesn't mount the device, at least, directly. Can you show full dmesg log when the issue reproduced?
Created attachment 283417 [details] dmesg output (with UFW spam removed) Yes, while running strace the issue was reproduced. I then immediately afterwards Ctrl+C strace and the program closed. I'll attach the dmesg output with the issue reproduced on linux-lts 4.19.51. To clarify: I can mount the device regularly, without issues (through my filemanager or manually). However, if I open Calibre, after it has already been mounted, Calibre does something with the device that causes this error. If I don't mount the device before I open Calibre, Calibre first mounts the device and then causes the error.
Thanks. Calibre looks like asking to do mount to UDisk2. So, can you strace udisk processes too? Well, looks like this is not fatfs issue. To test, it might be good to revert fatfs part that changed in 4.19.51.
I'm not sure how to go about stracing the udisk processes, I figured this would be started by Calibre and then included in strace. Currently I'm running 4.19.50, so there's no need to revert that part and recompile the kernel (for me). I'm not sure how to best help debug this issue otherwise.
Nevermind, I understood you now, I am supposed to revert back to 4.19.51 with the changed fatfs code and run strace on any udisk processes I find. I also just found the -p (PID) switch, so I'll try this.
Created attachment 283425 [details] sudo strace -p 991 -o log (running /usr/lib/udisks2/udisksd daemon)
Comment on attachment 283425 [details] sudo strace -p 991 -o log (running /usr/lib/udisks2/udisksd daemon) This is the strace from the 4.19.51. I attached the Kindle to the computer, did not mount it beforehand, strace'd the running UDisk2 daemon and then started Calibre. I waited until the bug occurs (the error shows up in dmesg and the device disconnects unintentionally) and then quit strace.
Oops, sorry. I should mention strace options too. Well, strace says about /dev/sdf though. It forks child threads. So can you strace child threads too? # strace -f -o log -p <udisk pid> Thanks.
please use the following instead. # strace -v -s 100000 -f -p <udisk pid>
Created attachment 283433 [details] strace -v -s 100000 -f -p <udisk pid> -o log.txt Same deal as before, attached Kindle, started strace, waited for error to occur and then Ctrl+C. By the way, once we hopefully figure out what's causing this would it be possible to delete (all) my attachments from this issue? There's quite a bit of information in them that I don't want to be public.
12023 mount("/dev/sdf", "/run/media/user/Kindle" ... ... 991 recvmsg(... "ACTION=change\0DEVPATH=/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5:1.0/host6/target6:0:0/6:0:0:0/block/sdf\0SUBSYSTEM=block\0DISK_MEDIA_CHANGE=1\0DEVNAME=/dev/sdf ...) ... 1015 openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:14.0/usb1/1-5/1-5:1.0/host6/target6:0:0/6:0:0:0/block/sdf/size"... 1015 read(16, "0\n", 4096) = 2 ... 12038 umount2("/run/media/user/Kindle", MNT_DETACH <unfinished ...> looks like mount is succeed, but got media change event somehow, and device seems to be disconnected as you says. So umount the mount point. This should be happening under vfat layer. Maybe, usb layer or such. Can you revert fat changes in 4.19.51 and rebuild kernel, then test? (copy fs/fat/file.c from 4.19.50 to 4.19.51) > delete (all) my attachments from this issue? Sorry, I don't know how to remove attachments. Please ask to bugzilla admin.
> Sorry, I don't know how to remove attachments. Please ask to bugzilla admin. I contacted the Helpdesk's e-mail, since I don't know who is administrating the bugtracker. > Can you revert fat changes in 4.19.51 and rebuild kernel, then test? I'll do this, but it might take me a bit.
The content of attachment 283433 [details] has been deleted for the following reason: Requested by the user
The content of attachment 283425 [details] has been deleted for the following reason: Requested by the user
The content of attachment 283417 [details] has been deleted for the following reason: Requested by the user
The content of attachment 283413 [details] has been deleted for the following reason: Requested by the user
Created attachment 283437 [details] revert changes to fs/fat/file.c to state of linux-lts 4.19.50 As requested, I built the 4.19.51 kernel with the changes made to fs/fat/file.c reverted to the state of the 4.19.50 release: I can confirm this fixes the issue. It's definitely those changes that lead to the unexpected behavior.
Created attachment 283439 [details] debug patch for no medium error Hm, interesting. That just changes the order of functions. Can you re-test with applying this debug patch, and send dmesg after test? If you don't want to public the log, it's ok to send email to me directly. hirofumi@mail.parknet.co.jp
BTW, debug patch is for vanila-4.19.51.
I'm compiling it right now and will report back in a bit.
I sent you the log via e-mail.
[ 47.271876] fat_file_fsync:204: path <filename> [ 47.272634] fat_file_fsync:207: err 0 [ 47.273305] fat_file_fsync:212: err 0 [ 47.281683] fat_file_fsync:217: err 0 ... [ 48.232901] fat_file_fsync:204: path <filename2> [ 48.265533] fat_file_fsync:207: err 0 [ 48.266637] fat_file_fsync:212: err 0 [ 48.491774] fat_file_fsync:217: err 0 ... [ 52.310349] media_not_present:1531: sense 2, asc fatfs operations succeed without any error. And after 4secs of that fsync path, The kindle returns NOT_READY (MEDIUM NOT PRESENT) to TEST UNIT READY command. This kindle behavior is root cause of this issue. However I have no idea, why kindle returns this, and why fatfs changes affect to this. It might be better to ask usb guys too, with usbmon trace.
Can you get usbmon data for kindle access? (if you don't know about usbmon, please ask. BTW, CONFIG_USB_MON=y or m is required)
I sent you the usbmon data per e-mail. I triggered the error while monitoring it.
Ouch, I should mentioned about tcpdump for usbmon. Please take usbmon data by pcap format. With pcap format, wireshark can decode usb packet data.
For example, # tcpdump -i usbmon0 -w usbmon.pcap ("usbmon0" might have to adjust for your environment.)
The file should have arrived per e-mail.
4203 9.160230 host → 1.25.1 USBMS 95 SCSI: Synchronize Cache(10) LUN: 0x00 (LBA: 0x00000000, Len: 0) 4206 9.164911 1.25.1 → host USBMS 77 SCSI: Response LUN: 0x00 (Synchronize Cache(10)) (Good) [...] 7323 10.202167 host -> 1.25.1 USBMS 95 SCSI: Synchronize Cache(10) LUN: 0x00 (LBA: 0x00000000, Len: 0) 7326 10.432266 1.25.1 -> host USBMS 77 SCSI: Response LUN: 0x00 (Synchronize Cache(10)) (Good) 7327 10.769092 host -> 1.25.1 USBMS 95 SCSI: Test Unit Ready LUN: 0x00 7330 10.769192 1.25.1 -> host USBMS 77 SCSI: Response LUN: 0x00 (Test Unit Ready) (Good) 7335 12.849093 host -> 1.25.1 USBMS 95 SCSI: Test Unit Ready LUN: 0x00 7338 12.849206 1.25.1 -> host USBMS 77 SCSI: Response LUN: 0x00 (Test Unit Ready) (Check Condition) 7339 12.849209 host -> 1.25.1 USBMS 95 SCSI: Request Sense LUN: 0x00 This is the simplified usbmon trace. First flush succeed and second flush too, then the followed TUR is succeed too. Until this point, all is normal. But after 2 secs, second TUR returns error, and issues auto-SENSE to check details (SENSE returns NOT_READY + MEDIUM NOT PRESENT). [I have no idea why 2secs delay returns error from the device.] TUR may be issued by kernel periodical checker. Can you check output of the following command? # cat /sys/module/block/parameters/events_dfl_poll_msecs And then, this command will disable TUR checker. If run this command before running calibre, does this change the behavior? # echo 0 > /sys/block/sdf/events_poll_msecs Please get usbmon again with this config change. [ # echo -1 > /sys/block/sdf/events_poll_msecs will revert the above change]
[root@host user]# cat /sys/module/block/parameters/events_dfl_poll_msecs 2000 The new usbmon capture should have arrived per e-mail. Setting events_poll_msecs=0 using the command does not change the behavior.
There is no TEST UNIT READY command as expected, and there is no any commands to kindle in usbmon capture. [ 3482.081778] VFS: busy inodes on changed media or resized disk sdf [ 3482.159247] print_req_error: I/O error, dev sdf, sector 1 [ 3482.159264] FAT-fs (sdf): bread failed in fat_clusters_flush [ 3482.159364] print_req_error: I/O error, dev sdf, sector 0 [ 3482.159391] FAT-fs (sdf): unable to read boot sector to mark fs as dirty After events_poll_msecs=0, this message really outputed after ran calibre?
I'll try again.
Please check the latest e-mail, containing "dmesg.txt" and "usbmon1.pcap". I am pretty sure I did it correctly, yes.
Created attachment 283463 [details] add nobarrier option OK, thank you. I confirmed the following line was gone, VFS: busy inodes on changed media or resized disk sdf but I/O still becomes error. FLUSH command may affect to kindle somehow. So, attached patch adds "nobarrier" mount option to fatfs. Can you run same test with "mount -o nobarrier" mount, and see result?
I'm compiling the linux-lts 4.19.51 with fat-debug.patch and fat-nobarrier.patch and then I'll test this, however this might take me a while.
I had to add "/dev/sdf /run/media/user/Kindle vfat defaults,nobarrier 0 2" to my fstab, so that Calibre would mount it through udisk with the correct options. I confirmed that this worked by checking "mount" and seeing this line: "/dev/sdf on /run/media/user/Kindle type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso88591,shortname=mixed,utf8,errors=remount-ro,nobarrier)". However now, Calibre won't recognize the device and "open" it -- the error won't even occur now.
If added that line to /etc/fstab, udisks2 use mount options exactly that is specified by /etc/fstab. So, to allow normal user access, you would have to use something like, /dev/sdf /run/media/user/Kindle vfat defaults,user,nobarrier,showexec,utf8,flush,nobarrier
maybe didn't work as intended. probably needs uid=,gid= /dev/sdf /run/media/user/Kindle vfat defaults,user,nobarrier,showexec,utf8,flush,nobarrier,uid=<uid name>,gid=<gid name>
read a source of udisks2 a bit, x-udisks-auth may work for this purpose /dev/sdf /run/media/user/Kindle vfat defaults,user,nobarrier,showexec,utf8,flush,nobarrier,x-udisks-auth
Ah sorry about that! I thought the requested options and those in /etc/fstab might be merged or something, or I might have simply forgotten about the access rights. I can confirm that when adding "/dev/sdf /run/media/user/Kindle vfat defaults,nobarrier,uid=1000,gid=1000 0 2" to /etc/fstab and then opening Calibre (letting it mount and "open/preview" the device), the bug doesn't occur anymore due to the nobarrier option! To make sure I also tested once with "/dev/sdf /run/media/user/Kindle vfat defaults,uid=1000,gid=1000 0 2" and the bug occurred again. Do you still need the usbmon data? It seems you've figured out the cause at least.
Thank you. No need usbmon anymore. I think kindle suspend or disconnect if it got sync then there is idle time. Just a guess though, kindle may be trying to detect safe-disconnect of windows or such. To workaround this and provide flexibility, I will add nobarrier option to fat. To config this automatically, user space daemon/app has to notice and need to change though. If the above guess is true, $ sync /run/media/user/Kindle/driveinfo.calibre then wait for a while, kindle disconnect? (further access will become I/O error)
> then wait for a while, kindle disconnect? (further access will become I/O > error) Yes, that's correct, the Kindle disconnects without the "nobarrier" option if I do this. > To config this automatically, user space daemon/app has to notice and need to > change though. So you're saying UDisk will see that the device is a Kindle device, and then automatically mount it with the nobarrier option? That seems sensible, because as it stands, I won't be the only person having this issue. I'm honestly quite puzzled why I am (seemingly) the only one reporting it. Any ideas what the next steps are? Are you going to notify the appropriate developers about this change? I'm not really sure who to talk to.
> So you're saying UDisk will see that the device is a Kindle device, and then > automatically mount it with the nobarrier option? That seems sensible, > because > as it stands, I won't be the only person having this issue. I'm > honestly quite > puzzled why I am (seemingly) the only one reporting it. Right, I'm not thinking deeply what is best way though. It might be good to tell udisks developer to allow "nobarrier" option in whitelist (whitelist is hardcoded in udisks :/). If "nobarrier" is in whiltelist, calibre itself can control "nobarrier" option within udisks request packet. (Or udisks can add always) Please contact what developers you need, I have no connection to those developers too. I myself going to submit this nobarrier option patch to kernel, and backport.
Thanks for all your patience and effort!
While discussing other kernel developer, found another options echo "write through" > /sys/block/<dev>/queue/write_cache or echo "temporary write through" > /sys/block/<dev>/device/scsi_disk/<device path>/cache_type (<dev> and <device path> has to change for target device) This options might be better. All disables flush command.
This still feels strange in retrospect. Either this is Kindle-specific and the Kindle is at fault -- and that's fine then. Otherwise, if this applies to all VFAT mounted filesystems now, (file activity is required after a fsync otherwise the device is ejected/filesystem unmounted) then this is clearly a regression in the Kernel and this manual workaround with disabling fsync is not the ideal solution. What about reverting the change to the VFAT code in the kernel?
I don't know if kindle specific or not though, this is first one that reported. Of course, other devices I know work fine with change or without change. And unfortunately, reverting change doesn't fix kindle fully. Old is, write data/inode flush write fat area New is write data/inode write fat area flush New one is right way for other devices. And even if reverted change, $ touch /run/media/user/Kindle/driveinfo.calibre $ sync /run/media/user/Kindle/driveinfo.calibre with above pattern, kindle will also disconnect. ("touch" doesn't dirty the fat area.) [BTW, nobarrier doesn't disable fsync fully, disable only flush.]
touch is not sufficient. In calibre I worked around this by creating a dummy, file, writing a few bytes to it and deleting the file after every fsync. The Kindle is a pretty widespread device. If it really does disconnect on seeing no activity after a flush, and this change remains in the kernel, it's going to cause a lot of breakage. To me the obvious solution is to follow the FLUSH with some harmless keep alive type fat area activity, a simple read might be sufficient, though I have no tested it. As far as calibre is concerned, I have already worked around it, but, there needs to be a proper fix, either in the kernel or the kindle firmware, which also uses linux, IIRC. And just to simplify testing exactly what activity works to prevent the eject, here is a one-liner python -c "p = '/run/media/kovid/Kindle/driveinfo.calibre'; f = open(p, 'r+b'); os.fsync(f.fileno());" This will cause an eject. You can add whatever you like after the call to the fsync to figure out what minimal activity the Kindle requires after flush. Replace the path to the moutpoint of the kindle as appropriate for your system.
yes, as I thought, a read is sufficient. So the following will cause an eject: python -c "import os; f = open('/run/media/kovid/Kindle/driveinfo.calibre', 'r+b'); os.fsync(f)" but the following will not: python -c "import os; f = open('/run/media/kovid/Kindle/driveinfo.calibre', 'r+b'); os.fsync(f); f.read(1)" Sadly, doing a stat on the file is not sufficient, since I guess it does not involve the fat area
utime() also seems to work to prevent ejects.
> touch is not sufficient. In calibre I worked around this by creating a dummy, > file, writing a few bytes to it and deleting the file after every fsync. touch is example of how to reproduce disconnect. Not to prevent. > The Kindle is a pretty widespread device. If it really does disconnect on > seeing no activity after a flush, and this change remains in the kernel, it's > going to cause a lot of breakage. Again, as I said, this happens with or without that change. And your workaround is depending on cached or not cached. If reproduced even if flush was disabled, let me know.
I think it is pretty clear that the Kindle does not disconnect without this change. I have been connecting Kindles to Linux computers without disconnects for over a decade. It is also pretty clear that, the kernel producing a FLUSH followed by no activity is going to cause kindles to disconnect. The obvious workaround is to produce some simple activity after the FLUSH. This should ideally be done by the kernel, with some testing to find out what the minimal required activity is. I posted an example with f.read() to show that even a read is sufficient, that is not the actual workaround I used. The actual workaround is to write things to the filesystem, so there is no question of caching. Anyway, it's your choice, you can either fix this properly now, before the problem becomes more widespread or deal with all the ensuing bug reports in the future, once kernels with this patch become more widely deployed. Expecting users to turn on device specific mount flags in fstab or every software that writes to such devices to somehow either mount or remount the devices with a device specific flag, is not a realistic solution. Nor is disabling FLUSH a good idea, it breaks the semantics of fsync. In any case I have worked around this bug in calibre, so my interest in this issue is minimal at best. Feel free to do whatever you like about it.
You are not understanding this issue correctly. fsync is broken without that change, the change is to fix of fsync breakage. And needless write make reduce liftime of flash device.
You are saying that to fix some theoretical fsync breakage that has presumably existed for decades you are willing to have fsync cause devices to eject!!! Do you think that fsync causing a device to eject is not broken? I wonder how every other OS on the planet manages to implement fsync without causing devices to eject. Yes, needless writes are unfortunate, but thanks to the breakage you have introduced, they are unavoidable on Linux. Everybody that wants to use fsync on ejectable devices now has to follow it up with extra writes to workaround this breakage. Because widely used devices assume that a fsync with no following activity is a reason to eject. I totally understand that this is a bug in the device, but nonetheless, this is a widely used device that has worked well with Linux for a decade, breaking that now, especially when the workaround is so simple, is really unfortunate.
> Nor is disabling FLUSH a good idea, it breaks the semantics of fsync. and now, you say > You are saying that to fix some theoretical fsync breakage that has > presumably > existed for decades you are willing to have fsync cause devices > to eject!!! If you don't care fsync flush semantics, you can just disable flush to workaround kindle behavior. > I wonder how every other OS on the planet manages to implement fsync without > causing devices to eject. True. If you can provide usbmon on other OS, it would be helpful info. > Yes, needless writes are unfortunate, but thanks to the breakage you have > introduced, As said, you can reproduce this issue without that change.
As I have already demonstrated ALL you need to do to fix this issue is do a READ on the device after every FLUSH. Just make sure it bypasses the cache, which you should be able to do since the cache is maintained by the kernel. In fact, probably a READ is not really necessary, simply making some kind of query to the device will be sufficient, but a READ is simple and side effect free.
Can't add penalty to all devices for one device is having strange behavior. If you prefer to use read command, you can do it with direct I/O without kernel help, and maybe possible to carefully use flush command too. > In fact, probably a READ is not really necessary At least, TUR doesn't work.
No, I am fine with using utime(). calibre only uses fsync() after writing a full file, so the extra writes caused by utime() are negligible in comparison. So, for calibre at least utime() is a reasonable workaround. My posts were not about calibre, but about the general problems that will be caused if such a widely used device suddenly start disconnecting seemingly randomly from linux machines. But what you want to do about that is your call.
> My posts were not about calibre, but about the general problems that will be > caused if such a widely used device suddenly start disconnecting seemingly > randomly from linux machines. Right, it is why trying to provide the option to disable flush command as relatively flexible workaround with above patch.
Yeah but the nobarrier option is going to require users to know to set it. Or to have all software maintain a database of devices that require it (I highly doubt kindle is the only one). Neither option is really workable. Not to mention that nobarrier basically breaks fsync completely, since now fsync no longer means data is actually written to the device. For example, calibre uses fsync so that if the user happens to pull the cable after transferring files to the device, without unmounting, the transferred files wont be corrupted. This simply will not work with nobarrier. Similarly any database software that relies on the semantics of fsync will also not work with nobarrier, leading to data loss.
True. Reverting that change breaks fsync flush semantics like you care now. Correct semantics by that change hitted the behavior of kindle disconnect. Whether there is real issue or not, is depending on in-device cache handling when lost power though (supercapacitor can help it). The "nobarrier" option is similar behavior like before that change. Make sure there is no disconnect, but fsync's flush command is broken.
I'm not suggesting you revert the change. I'm saying figure out what a minimal USB keep-alive command is that prevents the kindle from disconnecting and just send it after FLUSH. It's fairly obvious that all other OSes do this, otherwise the Kindle would never have been designed with this behavior. I am sympathetic with the desire to not penalize other devices for the kindle, but given that all other OSes do this there must be some USB command that has extremely low overhead that is sent after FLUSH by other OSes. If Linux did the same, I think that is the best solution.
It's why I asked to get usbmon by other OS what is doing. And if it is not fat layer stuff, I can't decide anymore, would have to report and ask to usb developer or such.
Unfortunately, nobarrier patch was rejected. So, the workaround would be echo "write through" > /sys/block/<dev>/queue/write_cache (<dev> has to change for target device) This disables flush command, and udev rule would be able to do this automatically
As testing shows, enough TEST UNIT READY commands (after SYNCHRONIZE CACHE) make the Kindle not disconnect. SO the easiest solution is to set events_poll_msecs low enough. Testing showed that values up until 950 seem to work. I suggest this udev rule and will submit it: ACTION=="add", SUBSYSTEM=="block", ENV{DEVTYPE}=="disk", \ SUBSYSTEMS=="usb", ATTRS{idVendor}=="1949", ATTRS{idProduct}=="0004", \ ATTR{events_poll_msecs}="800"
Meanwhile the issue has been fixed by commit 546aa0e4ea6ed81b6c51baeebc4364542fa3f3a7 (usb-storage: Add quirk to defeat Kindle's automatic unload), see https://github.com/torvalds/linux/commit/546aa0e4ea6ed81b6c51baeebc4364542fa3f3a7 It basically creates a new quirk (applied to Kindle) that sends a TEST UNIT READY command directly after a SYNCHRONIZE CACHE command. The fix will be in kernel 5.12. It also has also been backported to some stable series: * 4.14 * 4.19 * 5.4 * 5.10 * 5.11