Hi, We are using a thin LV as a backing store for a MySQL database, in order to be able to provide writable, point-in-time clones of the database. Our layout is as follows: LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert mysql data Vwi-aotz- 70.00g thinpool 74.44 mysql-1389783211 data Vwi-a-tz- 70.00g thinpool mysql 74.38 mysql-1389783211.t7 data Vwi-aotz- 70.00g thinpool mysql-1389783211 74.38 thinpool data twi-a-tz- 100.00g 57.16 - mysql is the source LV which is constantly written to by a MySQL instance running on the machine. - mysql-<timestamp> are snapshots of "mysql", which are not exported/mounted anywhere and are used purely for rollback purposes of the 2nd-level snapshots below. - mysql-<timestamp>.<name> are snapshots of mysql-<timestamp>, exported to other machines via iSCSI. We run v3.12.6 stable, which AFAICT from git has the same device-mapper code as master, regarding thin provisioning and space maps. We were hit by the following error twice within 3 days: Jan 15 20:15:44 zarniwoop kernel: [101754.726823] device-mapper: space map common: unable to decrement a reference count below 0 Jan 15 20:15:44 zarniwoop kernel: [101754.735191] device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22 Jan 15 20:15:44 zarniwoop kernel: [101754.742845] device-mapper: thin: 253:6: switching pool to read-only mode As expected, all subsequent write operations on the pool fail. One thing we would like to stress, is that we use ext4 with discards for these LVs, which might have something to do with the refcount issue: /dev/mapper/data-mysql on /var/lib/mysql type ext4 (rw,relatime,discard,nobarrier,stripe=16,data=ordered) Both, syslog and the hardware report no other errors or suspicious messages at or around the error time and the system has mild I/O activity. Please let us know if you need more information or you want us to test patches. Regards, Apollon
Created attachment 122541 [details] Bactktrace Just in case it helps, I'm attaching a backtrace obtained using WARN_ONCE instead of DMERR_LIMIT in dec_ref_count() (drivers/md/persistent-data/dm-space-map-common.c:500).
This patch introduced the message: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5b564d80f8bc21094c0cd2b19b679d983aabcc29
The problem is not the message itself, it's rather that dm_thin_insert_block() fails and the whole thin pool enters the read-only mode without an obvious cause (e.g. hardware errors). This has also happened to us on 3.11, which didn't include the said commit. Regards, Apollon
(In reply to Apollon Oikonomopoulos from comment #1) > Created attachment 122541 [details] > Bactktrace > > Just in case it helps, I'm attaching a backtrace obtained using WARN_ONCE > instead of DMERR_LIMIT in dec_ref_count() > (drivers/md/persistent-data/dm-space-map-common.c:500). This trace is merely the result of an earlier metadata problem. So it only offers tangential side-effect smoke, not the smoking gun (e.g. why the ref-count was dropped to 0 already). You mentioned that discards are enabled. We aren't aware of any known reflect issues associated with discards. Unfortunately, this bug isn't giving us much to go on in terms of a reproducer. But it is certainly appreciated that you filled it. Knowing that running thinp over a period of time results in "bad stuff happens" for your use-case is concerning, but it doesn't help us isolate the problem. You mentioned that you have seen issues before with 3.11. How did you recover? did you run thin_check, etc?
Did you happen to resize the metadata device at any point in the past?
Thanks for taking the time to investigate this! (In reply to Mike Snitzer from comment #4) > This trace is merely the result of an earlier metadata problem. So it only > offers tangential side-effect smoke, not the smoking gun (e.g. why the > ref-count was dropped to 0 already). I thought so, but still we were trying to find our way through the code. > You mentioned that discards are enabled. We aren't aware of any known > reflect issues associated with discards. Good to know. The reason I mentioned discards is that it's one path that to my understanding will decrement the refcount (the other being CoW, right?). > Unfortunately, this bug isn't giving us much to go on in terms of a > reproducer. But it is certainly appreciated that you filled it. Knowing > that running thinp over a period of time results in "bad stuff happens" for > your use-case is concerning, but it doesn't help us isolate the problem. Unfortunately we can't reproduce it at will either. We tried performing lots of I/O on both, 2nd-level snapshot and origin and it didn't crash over a period of hours. I suspect it might be triggered by MySQL's workload. > You mentioned that you have seen issues before with 3.11. How did you > recover? did you run thin_check, etc? Sofar it has crashed 3 times in 5 days. The LVs lifetime was 4 to 12 hours when the crashes occured. The first crash was with 3.11.10, the other two with 3.12.6, all of them due to dm_thin_insert_block() failing (3.11.10 did not include the reason nor error number, on 3.12.6 it was the refcount both times). On 3.11.10, the errors were: Jan 14 06:36:08 zarniwoop kernel: [416929.270970] device-mapper: thin: dm_thin_insert_block() failed Jan 14 06:36:08 zarniwoop kernel: [416929.276935] EXT4-fs warning (device dm-6): ext4_end_bio:332: I/O error writing to inode 21 (offset 0 size 0 starting block 13939440) Jan 14 06:36:08 zarniwoop kernel: [416929.276937] Buffer I/O error on device dm-6, logical block 13939440 Jan 14 06:36:23 zarniwoop kernel: [416943.321601] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483 Jan 14 06:36:23 zarniwoop kernel: [416943.323099] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483 Jan 14 06:36:23 zarniwoop kernel: [416943.323121] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483 Jan 14 06:36:23 zarniwoop kernel: [416943.401632] device-mapper: thin: process_bio: dm_thin_find_block() failed: erro r = -22 Jan 14 06:36:23 zarniwoop kernel: [416943.409645] Buffer I/O error on device dm-6, logical block 1041 Jan 14 06:36:23 zarniwoop kernel: [416943.415668] lost page write due to I/O error on dm-6 Jan 14 06:36:23 zarniwoop kernel: [416943.415675] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483 Jan 14 06:36:23 zarniwoop kernel: [416943.425707] Buffer I/O error on device dm-6, logical block 9249 Jan 14 06:36:23 zarniwoop kernel: [416943.431728] lost page write due to I/O error on dm-6 Jan 14 06:36:23 zarniwoop kernel: [416943.432609] Buffer I/O error on device dm-6, logical block 14680096 Jan 14 06:36:23 zarniwoop kernel: [416943.438989] lost page write due to I/O error on dm-6 Jan 14 06:36:25 zarniwoop kernel: [416946.154026] Aborting journal on device dm-6-8. Jan 14 06:36:25 zarniwoop kernel: [416946.158587] Buffer I/O error on device dm-6, logical block 8945664 Jan 14 06:36:25 zarniwoop kernel: [416946.164870] lost page write due to I/O error on dm-6 Jan 14 06:36:25 zarniwoop kernel: [416946.164873] JBD2: Error -5 detected when updating journal superblock for dm-6-8 . Jan 14 06:37:08 zarniwoop kernel: [416989.282691] dm_bm_validate_buffer: 732 callbacks suppressed Jan 14 06:37:08 zarniwoop kernel: [416989.282695] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483 Jan 14 06:37:08 zarniwoop kernel: [416989.292774] device-mapper: thin: dm_thin_get_highest_mapped_block returned -22 Jan 14 06:37:08 zarniwoop kernel: [416989.300170] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 10754 Jan 14 06:37:08 zarniwoop kernel: [416989.310330] device-mapper: thin: dm_thin_get_highest_mapped_block returned -22 Jan 14 06:37:08 zarniwoop kernel: [416989.324651] device-mapper: btree spine: node_check failed: csum 1397725534 != wanted 1397615814 Every time we had to wipe the whole thin pool and create it from scratch. We tried thin_check on the second crash, but it produced no result. The interesting part is that on another machine (older, with slower CPUs and disks) with 3.11.8 and the exact same setup, it ran without problem for 2 weeks. On this machine however, it crashed from the very beginning. There is no indication of hardware errors (IPMI event log, RAID controller event log and error counters, dmesg; they are all clean) and this machine has been known-good and used in production for quite some time (although not with thinp). Any ideas? Regards, Apollon
(In reply to Mike Snitzer from comment #5) > Did you happen to resize the metadata device at any point in the past? No, we never resized it.
And it crashed again (the same way), this time without discards enabled.
(In reply to Apollon Oikonomopoulos from comment #8) > And it crashed again (the same way), this time without discards enabled. So which kernel are you running now 3.12.6? You've referenced various kernels in this BZ. You say it crashed in the same way. Can you please provide the log? What was the system doing when it crashed? Was MySQL under load? Each time a crash happens what steps do you take to recover/repair?
(In reply to Apollon Oikonomopoulos from comment #6) > Every time we had to wipe the whole thin pool and create it from scratch. > We tried thin_check on the second crash, but it produced no result. Ah I missed this. ^ > The interesting part is that on another machine (older, with slower CPUs and > disks) with 3.11.8 and the exact same setup, it ran without problem for 2 > weeks. On this machine however, it crashed from the very beginning. There isn't a kernel crash. thinp becomes unusable. There is a difference. BTW, the latest thinp changes for 3.14 have improved metadata operation failure handling (merged by linus last night). Much more consistently traps/logs metadata operation errors early (and transitions the pool to read-only accordingly). I'll continue to try to come to terms with suggestions/ideas on your issue.
(In reply to Mike Snitzer from comment #9) > (In reply to Apollon Oikonomopoulos from comment #8) > > And it crashed again (the same way), this time without discards enabled. > > So which kernel are you running now 3.12.6? You've referenced various > kernels in this BZ. You say it crashed in the same way. Can you please > provide the log? Yes, that's still with 3.12.6 (apologies for not clarifying this earlier). Log follows: Jan 22 17:45:29 zarniwoop kernel: [84611.114932] device-mapper: space map common: unable to decrement a reference count below 0 Jan 22 17:45:29 zarniwoop kernel: [84611.123330] device-mapper: thin: 253:8: dm_thin_insert_block() failed: error = -22 Jan 22 17:45:29 zarniwoop kernel: [84611.131070] device-mapper: thin: 253:8: switching pool to read-only mode Jan 22 17:45:29 zarniwoop kernel: [84611.143470] EXT4-fs warning (device dm-10): ext4_end_bio:316: I/O error writing to inode 21 (offset 0 size 0 starting block 14461968) Jan 22 17:45:29 zarniwoop kernel: [84611.143478] Buffer I/O error on device dm-10, logical block 14461968 Jan 22 17:45:34 zarniwoop kernel: [84616.268022] Aborting journal on device dm-10-8. Jan 22 17:45:36 zarniwoop kernel: [84618.028351] EXT4-fs error (device dm-10): ext4_journal_check_start:56: Detected aborted journal Jan 22 17:45:36 zarniwoop kernel: [84618.037129] EXT4-fs (dm-10): Remounting filesystem read-only > What was the system doing when it crashed? Was MySQL under load? Each time > a crash happens what steps do you take to recover/repair? There were two mysql instances running at that point. The one is running locally on the machine and is slave to our production database, which means that it has some load. At the time of the failure it had (5-minute average): - ~100 transactions/sec - ~100 fsync()/sec - ~1.5 MB/s disk writes This was not the peak load of the day (which is approx. 2.5 - 3 times that). The other MySQL instance was running on a remote system and writing to one of the thin snapshots over iSCSI. This instance was completely idle. We tried thin_check without success. This time we tried offlining the VG and re-onlining it again. The pool seemed to restore to read-write mode and we re-mounted the filesystems, only to get: Jan 23 10:45:21 zarniwoop kernel: [145816.546635] device-mapper: space map common: unable to decrement a reference count below 0 again about an hour later. The filesystem seemed corrupt or out-of-date as well (MySQL was complaining about inconsistencies). Eventually we deleted the thin pool and recreated everything from scratch (pool, LVs, filesystem and database dump), which is what we have done in every failure sofar. P.S.: I know the kernel didn't crash in the sense of panic(), I was using the word "crash" as a general term for "unrecoverable failure" as far as we are concerned. Apologies for the ambiguity.
(In reply to Apollon Oikonomopoulos from comment #11) > (In reply to Mike Snitzer from comment #9) > > (In reply to Apollon Oikonomopoulos from comment #8) > > > And it crashed again (the same way), this time without discards enabled. > > > > So which kernel are you running now 3.12.6? You've referenced various > > kernels in this BZ. You say it crashed in the same way. Can you please > > provide the log? > > Yes, that's still with 3.12.6 (apologies for not clarifying this earlier). > Log follows: > > Jan 22 17:45:29 zarniwoop kernel: [84611.114932] device-mapper: space map > common: unable to decrement a reference count below 0 > Jan 22 17:45:29 zarniwoop kernel: [84611.123330] device-mapper: thin: 253:8: > dm_thin_insert_block() failed: error = -22 > Jan 22 17:45:29 zarniwoop kernel: [84611.131070] device-mapper: thin: 253:8: > switching pool to read-only mode > Jan 22 17:45:29 zarniwoop kernel: [84611.143470] EXT4-fs warning (device > dm-10): ext4_end_bio:316: I/O error writing to inode 21 (offset 0 size 0 > starting block 14461968) > Jan 22 17:45:29 zarniwoop kernel: [84611.143478] Buffer I/O error on device > dm-10, logical block 14461968 > Jan 22 17:45:34 zarniwoop kernel: [84616.268022] Aborting journal on device > dm-10-8. > Jan 22 17:45:36 zarniwoop kernel: [84618.028351] EXT4-fs error (device > dm-10): ext4_journal_check_start:56: Detected aborted journal > Jan 22 17:45:36 zarniwoop kernel: [84618.037129] EXT4-fs (dm-10): Remounting > filesystem read-only > > > What was the system doing when it crashed? Was MySQL under load? Each > time > > a crash happens what steps do you take to recover/repair? > > There were two mysql instances running at that point. The one is running > locally on the machine and is slave to our production database, which means > that it has some load. At the time of the failure it had (5-minute average): > > - ~100 transactions/sec > - ~100 fsync()/sec > - ~1.5 MB/s disk writes > > This was not the peak load of the day (which is approx. 2.5 - 3 times that). > > The other MySQL instance was running on a remote system and writing to one of > the thin snapshots over iSCSI. This instance was completely idle. OK, thankfully you have a stable production database. > We tried thin_check without success. This time we tried offlining the VG and > re-onlining it again. The pool seemed to restore to read-write mode and we > re-mounted the filesystems, only to get: > > Jan 23 10:45:21 zarniwoop kernel: [145816.546635] device-mapper: space map > common: unable to decrement a reference count below 0 > > again about an hour later. The filesystem seemed corrupt or out-of-date as > well > (MySQL was complaining about inconsistencies). > > Eventually we deleted the thin pool and recreated everything from scratch > (pool, LVs, filesystem and database dump), which is what we have done in > every > failure sofar. How open are you to trying the latest 3.13-stable kernel with the latest 3.14 DM changes included on this slave system? I can prep a fork of 3.13-stable that overlays the 3.14 DM changes. This would purely be to make certain you're running a kernel with all the latest DM changes. The closer we can be to a codebase I'm familiar with the better off we'll be on getting a fix for you. > P.S.: I know the kernel didn't crash in the sense of panic(), I was using the > word "crash" as a general term for "unrecoverable failure" as far as we > are concerned. Apologies for the ambiguity. I appreciate your willingness to stick with thinp. I know there are limits, for anyone, to tolerate such repeat failures. So I'm treating this issue with urgent care now -- provided there is synergy with code that I'm already needing to support.
(In reply to Mike Snitzer from comment #12) > How open are you to trying the latest 3.13-stable kernel with the latest > 3.14 DM changes included on this slave system? I can prep a fork of > 3.13-stable that overlays the 3.14 DM changes. This would purely be to make > certain you're running a kernel with all the latest DM changes. The closer > we can be to a codebase I'm familiar with the better off we'll be on getting > a fix for you. I was looking at cherry-picking the changes myself, but I think I'd appreciate some help there. This is a service still in development, so we can experiment a bit without problems. However, before we take this route, there is one last test I'd like to make. The problem with this bug report is that I feel I can't provide any useful information at all. Presumably there is somewhere some silent corruption taking place that I can't put my finger on. Also what bothers me most is (like I said in a previous mail) that we had the exact same setup running on another machine before without problems, with kernel 3.11.10 (which also failed on the new machine before trying 3.12.6). I've been banging my head against the wall trying to find out what could be different between these machines and I remembered a small detail: The good machine had the thin pool created while running 3.2-stable. We subsequently upgraded to 3.11 (keeping the same pool) to take advantage of discards and everything worked fine there for more than 3 weeks. On this new machine that has been failing, we created the thin pool directly with 3.11 (and later 3.12). Since this is the only difference I can think of, do you think there is a theoretic possibility of a bug in the metadata initialization code introduced somewhere after 3.2? In order to verify this, we re-created the thin pool using 3.2 now and will reboot to 3.12 after we finish loading our data to see how it goes. If it fails again, we'd be happy to try 3.13+master. If it works fine, I guess we'll have to bisect. > I appreciate your willingness to stick with thinp. I know there are limits, > for anyone, to tolerate such repeat failures. So I'm treating this issue > with urgent care now -- provided there is synergy with code that I'm already > needing to support. Thanks again for looking into this, we'll be happy to help get this fixed.
A brief status update after a couple of experiments: - We wiped the whole thin pool and re-created it using kernel 3.2.51. The setup worked fine with 3.2.51 for 4 days, at which point: - We rebooted the machine into 3.12.6. It ran for about 5-6 hours and then complained about the usual refcount-below-zero situation, placing the pool into read-only mode. - We rebooted again into 3.2.51, which was happy to open and use the thin pool in read-write mode (that is, we did not recreate the thin pool). It's still running after 4 days without an issue. On every reboot into a new kernel I forced an fsck on the thin volumes, which showed no sign of corruption. Given the above, how do you suggest we proceed? Regards, Apollon
Hi, please try this commit (relevant for stable 3.12+ despite the commit saying 3.13+.. I'll email stable to make sure they pick it up for 3.12 once it goes upstream): https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-linus&id=cebc2de44d3bce53e46476e774126c298ca2c8a9
(In reply to Mike Snitzer from comment #15) > Hi, please try this commit (relevant for stable 3.12+ despite the commit > saying 3.13+.. I'll email stable to make sure they pick it up for 3.12 once > it goes upstream): > > https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/ > commit/?h=for-linus&id=cebc2de44d3bce53e46476e774126c298ca2c8a9 Hi, I applied the commit on top of 3.12.9 and replaced the original module. Sofar it's been working, but I guess we'll have to wait a couple of days to declare victory. Thanks for the response! Apollon
Hi Mike, It has been running stable for a week now, so I think you can mark this as fixed. Thanks again for the response, Apollon