Bug 68801

Summary: thin-pool: space map common: unable to decrement a reference count below 0
Product: IO/Storage Reporter: Apollon Oikonomopoulos (apoikos)
Component: LVM2/DMAssignee: Alasdair G Kergon (agk)
Status: CLOSED CODE_FIX    
Severity: normal CC: agk, snitzer, yatiohi
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 3.12.6 Subsystem:
Regression: No Bisected commit-id:
Attachments: Bactktrace

Description Apollon Oikonomopoulos 2014-01-16 12:30:58 UTC
Hi,

We are using a thin LV as a backing store for a MySQL database, in order to
be able to provide writable, point-in-time clones of the database. Our layout
is as follows:

  LV                  VG    Attr      LSize   Pool     Origin           Data%  Move Log Copy%  Convert
  mysql               data  Vwi-aotz-  70.00g thinpool                   74.44                        
  mysql-1389783211    data  Vwi-a-tz-  70.00g thinpool mysql             74.38                        
  mysql-1389783211.t7 data  Vwi-aotz-  70.00g thinpool mysql-1389783211  74.38                        
  thinpool            data  twi-a-tz- 100.00g                            57.16                      

 - mysql is the source LV which is constantly written to by a MySQL instance
   running on the machine.
 - mysql-<timestamp> are snapshots of "mysql", which are not exported/mounted
   anywhere and are used purely for rollback purposes of the 2nd-level
   snapshots below.
 - mysql-<timestamp>.<name> are snapshots of mysql-<timestamp>, exported to
   other machines via iSCSI.

We run v3.12.6 stable, which AFAICT from git has the same device-mapper code as
master, regarding thin provisioning and space maps. We were hit by the
following error twice within 3 days:

Jan 15 20:15:44 zarniwoop kernel: [101754.726823] device-mapper: space map common: unable to decrement a reference count below 0
Jan 15 20:15:44 zarniwoop kernel: [101754.735191] device-mapper: thin: 253:6: dm_thin_insert_block() failed: error = -22
Jan 15 20:15:44 zarniwoop kernel: [101754.742845] device-mapper: thin: 253:6: switching pool to read-only mode

As expected, all subsequent write operations on the pool fail. One thing we
would like to stress, is that we use ext4 with discards for these LVs, which
might have something to do with the refcount issue:

/dev/mapper/data-mysql on /var/lib/mysql type ext4 (rw,relatime,discard,nobarrier,stripe=16,data=ordered)

Both, syslog and the hardware report no other errors or suspicious messages at
or around the error time and the system has mild I/O activity.

Please let us know if you need more information or you want us to test patches.

Regards,
Apollon
Comment 1 Apollon Oikonomopoulos 2014-01-19 01:38:05 UTC
Created attachment 122541 [details]
Bactktrace

Just in case it helps, I'm attaching a backtrace obtained using WARN_ONCE instead of DMERR_LIMIT in dec_ref_count() (drivers/md/persistent-data/dm-space-map-common.c:500).
Comment 3 Apollon Oikonomopoulos 2014-01-21 10:33:29 UTC
The problem is not the message itself, it's rather that dm_thin_insert_block() fails and the whole thin pool enters the read-only mode without an obvious cause (e.g. hardware errors). This has also happened to us on 3.11, which didn't include the said commit.

Regards,
Apollon
Comment 4 Mike Snitzer 2014-01-21 17:38:44 UTC
(In reply to Apollon Oikonomopoulos from comment #1)
> Created attachment 122541 [details]
> Bactktrace
> 
> Just in case it helps, I'm attaching a backtrace obtained using WARN_ONCE
> instead of DMERR_LIMIT in dec_ref_count()
> (drivers/md/persistent-data/dm-space-map-common.c:500).

This trace is merely the result of an earlier metadata problem.  So it only offers tangential side-effect smoke, not the smoking gun (e.g. why the ref-count was dropped to 0 already).

You mentioned that discards are enabled.  We aren't aware of any known reflect issues associated with discards.

Unfortunately, this bug isn't giving us much to go on in terms of a reproducer.  But it is certainly appreciated that you filled it.  Knowing that running thinp over a period of time results in "bad stuff happens" for your use-case is concerning, but it doesn't help us isolate the problem.

You mentioned that you have seen issues before with 3.11.  How did you recover?  did you run thin_check, etc?
Comment 5 Mike Snitzer 2014-01-21 17:41:21 UTC
Did you happen to resize the metadata device at any point in the past?
Comment 6 Apollon Oikonomopoulos 2014-01-22 09:21:13 UTC
Thanks for taking the time to investigate this!

(In reply to Mike Snitzer from comment #4)
> This trace is merely the result of an earlier metadata problem.  So it only
> offers tangential side-effect smoke, not the smoking gun (e.g. why the
> ref-count was dropped to 0 already).

I thought so, but still we were trying to find our way through the code.

> You mentioned that discards are enabled.  We aren't aware of any known
> reflect issues associated with discards.

Good to know. The reason I mentioned discards is that it's one path that to my
understanding will decrement the refcount (the other being CoW, right?).

> Unfortunately, this bug isn't giving us much to go on in terms of a
> reproducer.  But it is certainly appreciated that you filled it.  Knowing
> that running thinp over a period of time results in "bad stuff happens" for
> your use-case is concerning, but it doesn't help us isolate the problem.

Unfortunately we can't reproduce it at will either. We tried performing lots
of I/O on both, 2nd-level snapshot and origin and it didn't crash over a
period of hours. I suspect it might be triggered by MySQL's workload.

> You mentioned that you have seen issues before with 3.11.  How did you
> recover?  did you run thin_check, etc?

Sofar it has crashed 3 times in 5 days. The LVs lifetime was 4 to 12 hours when
the crashes occured.

The first crash was with 3.11.10, the other two with 3.12.6, all of them due to
dm_thin_insert_block() failing (3.11.10 did not include the reason nor error
number, on 3.12.6 it was the refcount both times). On 3.11.10, the
errors were:

Jan 14 06:36:08 zarniwoop kernel: [416929.270970] device-mapper: thin: dm_thin_insert_block() failed
Jan 14 06:36:08 zarniwoop kernel: [416929.276935] EXT4-fs warning (device dm-6): ext4_end_bio:332: I/O error writing to inode 21 (offset 0 size 0 starting block 13939440)
Jan 14 06:36:08 zarniwoop kernel: [416929.276937] Buffer I/O error on device dm-6, logical block 13939440
Jan 14 06:36:23 zarniwoop kernel: [416943.321601] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483
Jan 14 06:36:23 zarniwoop kernel: [416943.323099] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483
Jan 14 06:36:23 zarniwoop kernel: [416943.323121] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 9483
Jan 14 06:36:23 zarniwoop kernel: [416943.401632] device-mapper: thin: process_bio: dm_thin_find_block() failed: erro
r = -22
Jan 14 06:36:23 zarniwoop kernel: [416943.409645] Buffer I/O error on device dm-6, logical block 1041
Jan 14 06:36:23 zarniwoop kernel: [416943.415668] lost page write due to I/O error on dm-6
Jan 14 06:36:23 zarniwoop kernel: [416943.415675] device-mapper: block manager: validator mismatch (old=sm_bitmap vs 
new=btree_node) for block 9483
Jan 14 06:36:23 zarniwoop kernel: [416943.425707] Buffer I/O error on device dm-6, logical block 9249
Jan 14 06:36:23 zarniwoop kernel: [416943.431728] lost page write due to I/O error on dm-6
Jan 14 06:36:23 zarniwoop kernel: [416943.432609] Buffer I/O error on device dm-6, logical block 14680096
Jan 14 06:36:23 zarniwoop kernel: [416943.438989] lost page write due to I/O error on dm-6
Jan 14 06:36:25 zarniwoop kernel: [416946.154026] Aborting journal on device dm-6-8.
Jan 14 06:36:25 zarniwoop kernel: [416946.158587] Buffer I/O error on device dm-6, logical block 8945664
Jan 14 06:36:25 zarniwoop kernel: [416946.164870] lost page write due to I/O error on dm-6
Jan 14 06:36:25 zarniwoop kernel: [416946.164873] JBD2: Error -5 detected when updating journal superblock for dm-6-8
.
Jan 14 06:37:08 zarniwoop kernel: [416989.282691] dm_bm_validate_buffer: 732 callbacks suppressed
Jan 14 06:37:08 zarniwoop kernel: [416989.282695] device-mapper: block manager: validator mismatch (old=sm_bitmap vs 
new=btree_node) for block 9483
Jan 14 06:37:08 zarniwoop kernel: [416989.292774] device-mapper: thin: dm_thin_get_highest_mapped_block returned -22
Jan 14 06:37:08 zarniwoop kernel: [416989.300170] device-mapper: block manager: validator mismatch (old=sm_bitmap vs new=btree_node) for block 10754
Jan 14 06:37:08 zarniwoop kernel: [416989.310330] device-mapper: thin: dm_thin_get_highest_mapped_block returned -22
Jan 14 06:37:08 zarniwoop kernel: [416989.324651] device-mapper: btree spine: node_check failed: csum 1397725534 != wanted 1397615814

Every time we had to wipe the whole thin pool and create it from scratch.
We tried thin_check on the second crash, but it produced no result.

The interesting part is that on another machine (older, with slower CPUs and
disks) with 3.11.8 and the exact same setup, it ran without problem for 2
weeks. On this machine however, it crashed from the very beginning.

There is no indication of hardware errors (IPMI event log, RAID controller
event log and error counters, dmesg; they are all clean) and this machine has
been known-good and used in production for quite some time (although not with
thinp).

Any ideas?

Regards,
Apollon
Comment 7 Apollon Oikonomopoulos 2014-01-22 09:21:29 UTC
(In reply to Mike Snitzer from comment #5)
> Did you happen to resize the metadata device at any point in the past?

No, we never resized it.
Comment 8 Apollon Oikonomopoulos 2014-01-23 07:38:43 UTC
And it crashed again (the same way), this time without discards enabled.
Comment 9 Mike Snitzer 2014-01-23 14:40:15 UTC
(In reply to Apollon Oikonomopoulos from comment #8)
> And it crashed again (the same way), this time without discards enabled.

So which kernel are you running now 3.12.6?  You've referenced various kernels in this BZ.  You say it crashed in the same way.  Can you please provide the log?

What was the system doing when it crashed?  Was MySQL under load?  Each time a crash happens what steps do you take to recover/repair?
Comment 10 Mike Snitzer 2014-01-23 14:55:52 UTC
(In reply to Apollon Oikonomopoulos from comment #6)

> Every time we had to wipe the whole thin pool and create it from scratch.
> We tried thin_check on the second crash, but it produced no result.

Ah I missed this. ^

> The interesting part is that on another machine (older, with slower CPUs and
> disks) with 3.11.8 and the exact same setup, it ran without problem for 2
> weeks. On this machine however, it crashed from the very beginning.

There isn't a kernel crash.  thinp becomes unusable.  There is a difference.

BTW, the latest thinp changes for 3.14 have improved metadata operation failure handling (merged by linus last night).  Much more consistently traps/logs metadata operation errors early (and transitions the pool to read-only accordingly).

I'll continue to try to come to terms with suggestions/ideas on your issue.
Comment 11 Apollon Oikonomopoulos 2014-01-23 15:04:06 UTC
(In reply to Mike Snitzer from comment #9)
> (In reply to Apollon Oikonomopoulos from comment #8)
> > And it crashed again (the same way), this time without discards enabled.
>
> So which kernel are you running now 3.12.6?  You've referenced various
> kernels in this BZ.  You say it crashed in the same way.  Can you please
> provide the log?

Yes, that's still with 3.12.6 (apologies for not clarifying this earlier). Log follows:

Jan 22 17:45:29 zarniwoop kernel: [84611.114932] device-mapper: space map common: unable to decrement a reference count below 0
Jan 22 17:45:29 zarniwoop kernel: [84611.123330] device-mapper: thin: 253:8: dm_thin_insert_block() failed: error = -22
Jan 22 17:45:29 zarniwoop kernel: [84611.131070] device-mapper: thin: 253:8: switching pool to read-only mode
Jan 22 17:45:29 zarniwoop kernel: [84611.143470] EXT4-fs warning (device dm-10): ext4_end_bio:316: I/O error writing to inode 21 (offset 0 size 0 starting block 14461968)
Jan 22 17:45:29 zarniwoop kernel: [84611.143478] Buffer I/O error on device dm-10, logical block 14461968
Jan 22 17:45:34 zarniwoop kernel: [84616.268022] Aborting journal on device dm-10-8.
Jan 22 17:45:36 zarniwoop kernel: [84618.028351] EXT4-fs error (device dm-10): ext4_journal_check_start:56: Detected aborted journal
Jan 22 17:45:36 zarniwoop kernel: [84618.037129] EXT4-fs (dm-10): Remounting filesystem read-only
 
> What was the system doing when it crashed?  Was MySQL under load?  Each time
> a crash happens what steps do you take to recover/repair?

There were two mysql instances running at that point. The one is running
locally on the machine and is slave to our production database, which means
that it has some load. At the time of the failure it had (5-minute average):

 - ~100 transactions/sec
 - ~100 fsync()/sec
 - ~1.5 MB/s disk writes

This was not the peak load of the day (which is approx. 2.5 - 3 times that).

The other MySQL instance was running on a remote system and writing to one of
the thin snapshots over iSCSI. This instance was completely idle.

We tried thin_check without success. This time we tried offlining the VG and
re-onlining it again. The pool seemed to restore to read-write mode and we
re-mounted the filesystems, only to get:

Jan 23 10:45:21 zarniwoop kernel: [145816.546635] device-mapper: space map common: unable to decrement a reference count below 0

again about an hour later. The filesystem seemed corrupt or out-of-date as well
(MySQL was complaining about inconsistencies).

Eventually we deleted the thin pool and recreated everything from scratch
(pool, LVs, filesystem and database dump), which is what we have done in every
failure sofar.

P.S.: I know the kernel didn't crash in the sense of panic(), I was using the
      word "crash" as a general term for "unrecoverable failure" as far as we
      are concerned. Apologies for the ambiguity.
Comment 12 Mike Snitzer 2014-01-23 15:26:08 UTC
(In reply to Apollon Oikonomopoulos from comment #11)
> (In reply to Mike Snitzer from comment #9)
> > (In reply to Apollon Oikonomopoulos from comment #8)
> > > And it crashed again (the same way), this time without discards enabled.
> >
> > So which kernel are you running now 3.12.6?  You've referenced various
> > kernels in this BZ.  You say it crashed in the same way.  Can you please
> > provide the log?
> 
> Yes, that's still with 3.12.6 (apologies for not clarifying this earlier).
> Log follows:
> 
> Jan 22 17:45:29 zarniwoop kernel: [84611.114932] device-mapper: space map
> common: unable to decrement a reference count below 0
> Jan 22 17:45:29 zarniwoop kernel: [84611.123330] device-mapper: thin: 253:8:
> dm_thin_insert_block() failed: error = -22
> Jan 22 17:45:29 zarniwoop kernel: [84611.131070] device-mapper: thin: 253:8:
> switching pool to read-only mode
> Jan 22 17:45:29 zarniwoop kernel: [84611.143470] EXT4-fs warning (device
> dm-10): ext4_end_bio:316: I/O error writing to inode 21 (offset 0 size 0
> starting block 14461968)
> Jan 22 17:45:29 zarniwoop kernel: [84611.143478] Buffer I/O error on device
> dm-10, logical block 14461968
> Jan 22 17:45:34 zarniwoop kernel: [84616.268022] Aborting journal on device
> dm-10-8.
> Jan 22 17:45:36 zarniwoop kernel: [84618.028351] EXT4-fs error (device
> dm-10): ext4_journal_check_start:56: Detected aborted journal
> Jan 22 17:45:36 zarniwoop kernel: [84618.037129] EXT4-fs (dm-10): Remounting
> filesystem read-only
>  
> > What was the system doing when it crashed?  Was MySQL under load?  Each
> time
> > a crash happens what steps do you take to recover/repair?
> 
> There were two mysql instances running at that point. The one is running
> locally on the machine and is slave to our production database, which means
> that it has some load. At the time of the failure it had (5-minute average):
> 
>  - ~100 transactions/sec
>  - ~100 fsync()/sec
>  - ~1.5 MB/s disk writes
> 
> This was not the peak load of the day (which is approx. 2.5 - 3 times that).
> 
> The other MySQL instance was running on a remote system and writing to one of
> the thin snapshots over iSCSI. This instance was completely idle.

OK, thankfully you have a stable production database.

> We tried thin_check without success. This time we tried offlining the VG and
> re-onlining it again. The pool seemed to restore to read-write mode and we
> re-mounted the filesystems, only to get:
> 
> Jan 23 10:45:21 zarniwoop kernel: [145816.546635] device-mapper: space map
> common: unable to decrement a reference count below 0
> 
> again about an hour later. The filesystem seemed corrupt or out-of-date as
> well
> (MySQL was complaining about inconsistencies).
> 
> Eventually we deleted the thin pool and recreated everything from scratch
> (pool, LVs, filesystem and database dump), which is what we have done in
> every
> failure sofar.

How open are you to trying the latest 3.13-stable kernel with the latest 3.14 DM changes included on this slave system?  I can prep a fork of 3.13-stable that overlays the 3.14 DM changes.  This would purely be to make certain you're running a kernel with all the latest DM changes.  The closer we can be to a codebase I'm familiar with the better off we'll be on getting a fix for you.

> P.S.: I know the kernel didn't crash in the sense of panic(), I was using the
>       word "crash" as a general term for "unrecoverable failure" as far as we
>       are concerned. Apologies for the ambiguity.

I appreciate your willingness to stick with thinp.  I know there are limits, for anyone, to tolerate such repeat failures.  So I'm treating this issue with urgent care now -- provided there is synergy with code that I'm already needing to support.
Comment 13 Apollon Oikonomopoulos 2014-01-23 16:09:12 UTC
(In reply to Mike Snitzer from comment #12) 
> How open are you to trying the latest 3.13-stable kernel with the latest
> 3.14 DM changes included on this slave system?  I can prep a fork of
> 3.13-stable that overlays the 3.14 DM changes.  This would purely be to make
> certain you're running a kernel with all the latest DM changes.  The closer
> we can be to a codebase I'm familiar with the better off we'll be on getting
> a fix for you.

I was looking at cherry-picking the changes myself, but I think I'd appreciate
some help there. This is a service still in development, so we can experiment a
bit without problems.

However, before we take this route, there is one last test I'd like to make.
The problem with this bug report is that I feel I can't provide any useful
information at all. Presumably there is somewhere some silent corruption taking
place that I can't put my finger on.

Also what bothers me most is (like I said in a previous mail) that we had the
exact same setup running on another machine before without problems, with
kernel 3.11.10 (which also failed on the new machine before trying 3.12.6).
I've been banging my head against the wall trying to find out what could be
different between these machines and I remembered a small detail:

The good machine had the thin pool created while running 3.2-stable. We
subsequently upgraded to 3.11 (keeping the same pool) to take advantage of
discards and everything worked fine there for more than 3 weeks. On this new
machine that has been failing, we created the thin pool directly with 3.11 (and
later 3.12). Since this is the only difference I can think of, do you think
there is a theoretic possibility of a bug in the metadata initialization code
introduced somewhere after 3.2?

In order to verify this, we re-created the thin pool using 3.2 now and will
reboot to 3.12 after we finish loading our data to see how it goes. If it fails
again, we'd be happy to try 3.13+master. If it works fine, I guess we'll have
to bisect.

> I appreciate your willingness to stick with thinp.  I know there are limits,
> for anyone, to tolerate such repeat failures.  So I'm treating this issue
> with urgent care now -- provided there is synergy with code that I'm already
> needing to support.

Thanks again for looking into this, we'll be happy to help get this fixed.
Comment 14 Apollon Oikonomopoulos 2014-02-04 09:35:09 UTC
A brief status update after a couple of experiments:

 - We wiped the whole thin pool and re-created it using kernel 3.2.51. The
   setup worked fine with 3.2.51 for 4 days, at which point:

 - We rebooted the machine into 3.12.6. It ran for about 5-6 hours and then
   complained about the usual refcount-below-zero situation, placing the pool
   into read-only mode.

 - We rebooted again into 3.2.51, which was happy to open and use the thin
   pool in read-write mode (that is, we did not recreate the thin pool). It's
   still running after 4 days without an issue.

On every reboot into a new kernel I forced an fsck on the thin volumes, which
showed no sign of corruption.

Given the above, how do you suggest we proceed?

Regards,
Apollon
Comment 15 Mike Snitzer 2014-03-07 20:51:42 UTC
Hi, please try this commit (relevant for stable 3.12+ despite the commit saying 3.13+.. I'll email stable to make sure they pick it up for 3.12 once it goes upstream):

https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-linus&id=cebc2de44d3bce53e46476e774126c298ca2c8a9
Comment 16 Apollon Oikonomopoulos 2014-03-10 18:31:43 UTC
(In reply to Mike Snitzer from comment #15)
> Hi, please try this commit (relevant for stable 3.12+ despite the commit
> saying 3.13+.. I'll email stable to make sure they pick it up for 3.12 once
> it goes upstream):
> 
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/
> commit/?h=for-linus&id=cebc2de44d3bce53e46476e774126c298ca2c8a9

Hi,

I applied the commit on top of 3.12.9 and replaced the original module. Sofar it's been working, but I guess we'll have to wait a couple of days to declare victory.

Thanks for the response!
Apollon
Comment 17 Apollon Oikonomopoulos 2014-03-18 13:51:41 UTC
Hi Mike,

It has been running stable for a week now, so I think you can mark this as fixed.

Thanks again for the response,
Apollon