Bug 102731 - I have a cough.
Summary: I have a cough.
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-12 08:47 UTC by John Hughes
Modified: 2016-04-19 14:25 UTC (History)
7 users (show)

See Also:
Kernel Version: 3.16
Tree: Mainline
Regression: Yes


Attachments
config for kernel 3.18.19 (156.70 KB, application/octet-stream)
2015-08-12 19:25 UTC, John Hughes
Details
fsck output after reboot. (10.42 KB, text/plain)
2015-08-31 15:47 UTC, John Hughes
Details
attachment-20424-0.html (696 bytes, text/html)
2016-01-22 16:36 UTC, John Hughes
Details

Description John Hughes 2015-08-12 08:47:05 UTC
This is a bug opened to talk about my symptoms that lead up to opening bug 

https://bugzilla.kernel.org/show_bug.cgi?id=89621

as requested by Theodore Tso
Comment 1 John Hughes 2015-08-12 08:56:30 UTC
The hardware:

A couple of Dell PowerEdge 840's (Xeon 3050, 4G).

disks SATA internal and SCSI in external cabinets

Host OS Debian 7.7 kernel 3.16, version  3.16.5-1~bpo70+1

Disks assembled into RAID1 arrays then passed to kvm guest as virtio.

Guest OS Debian 7.7, kernel either as host, or now 3.18.19

Guest uses LVM2

All filesystems on guest are ext3, now handled by ext4 subsystem.
Comment 2 John Hughes 2015-08-12 09:02:48 UTC
host uptime:

 10:59:22 up 24 days, 14:21,  2 users,  load average: 0.39, 0.51, 0.90

(Last reboot caused by power cut)

PCI config on host:

00:00.0 Host bridge: Intel Corporation E7230/3000/3010 Memory Controller Hub
00:01.0 PCI bridge: Intel Corporation E7230/3000/3010 PCI Express Root Port
00:1c.0 PCI bridge: Intel Corporation NM10/ICH7 Family PCI Express Port 1 (rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI Express Port 6 (rev 01)
00:1d.0 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #1 (rev 01)
00:1d.1 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #2 (rev 01)
00:1d.2 USB controller: Intel Corporation NM10/ICH7 Family USB UHCI Controller #3 (rev 01)
00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation NM10/ICH7 Family SMBus Controller (rev 01)
01:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09)
01:00.1 PIC: Intel Corporation 6700/6702PXH I/OxAPIC Interrupt Controller A (rev 09)
02:08.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01)
03:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 09)
04:01.0 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
04:01.1 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01)
04:02.0 Ethernet controller: Intel Corporation 82543GC Gigabit Ethernet Controller (Copper) (rev 02)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
07:00.0 Ethernet controller: Altima (nee Broadcom) AC9100 Gigabit Ethernet (rev 15)
07:05.0 VGA compatible controller: Advanced Micro Devices [AMD] nee ATI ES1000 (rev 02)


Disk configuration on host:

[0:0:0:0]    cd/dvd  LITE-ON  CD-ROM LTN-4891S NDS3  /dev/sr0 
[2:0:0:0]    disk    HITACHI  HUS151473VL3800  S110  /dev/sda 
[2:0:1:0]    disk    FUJITSU  MAP3735NC        5605  /dev/sdb 
[2:0:2:0]    disk    FUJITSU  MAP3735NC        5605  /dev/sdc 
[2:0:3:0]    disk    FUJITSU  MAP3735NC        5605  /dev/sdd 
[2:0:8:0]    disk    QUANTUM  ATLAS10K2-TY367J DA40  /dev/sde 
[2:0:10:0]   disk    QUANTUM  ATLAS10K2-TY367J DA40  /dev/sdf 
[3:0:0:0]    disk    IBM      IC35L036UCDY10-0 S27E  /dev/sdg 
[3:0:1:0]    disk    QUANTUM  ATLAS10K2-TY367J DA40  /dev/sdh 
[3:0:2:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdi 
[3:0:3:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdj 
[3:0:4:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdk 
[3:0:5:0]    disk    FUJITSU  MAP3735NC        5608  /dev/sdl 
[3:0:9:0]    disk    IBM      DMVS18M          0220  /dev/sdm 
[3:0:10:0]   disk    IBM      DMVS18D          02B0  /dev/sdn 
[3:0:11:0]   disk    IBM      IC35L036UCDY10-0 S27F  /dev/sdo 
[3:0:12:0]   disk    SEAGATE  ST318305LC       2203  /dev/sdp 
[3:0:13:0]   disk    FUJITSU  MAX3073NC        0104  /dev/sdq 
[3:0:15:0]   process Dell     12 BAY U2W CU    0209  -        
[4:0:0:0]    disk    ATA      ST3808110AS      J     /dev/sds 
[4:0:1:0]    disk    ATA      ST3808110AS      J     /dev/sdr 
[4:0:2:0]    disk    ATA      ST3808110AS      J     /dev/sdt 
[4:0:3:0]    disk    ATA      ST3808110AS      J     /dev/sdu 

mdadm config on host:

Personalities : [raid1] 
md122 : active raid1 sda1[2] sdj1[3]
      71680902 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md123 : active raid1 sdb1[0] sdk1[1]
      71616384 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md124 : active raid1 sdg1[3] sdf1[2]
      35549071 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md125 : active raid1 sdh1[2] sde1[3]
      35549071 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid1 sdc1[0] sdi1[2]
      71680902 blocks super 1.2 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md127 : active raid1 sdd[0] sdl[1]
      71621696 blocks super 1.2 [2/2] [UU]
      
md2 : active (auto-read-only) raid1 sdt1[2] sdu1[1]
      78121912 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sds2[0] sdr2[1]
      77623224 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md0 : active raid1 sds1[0] sdr1[1]
      498676 blocks super 1.2 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

Last dmesg messages on host:

[1225118.952197] md: data-check of RAID array md0
[1225118.952291] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225118.952384] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225118.952528] md: using 128k window, over a total of 498676k.
[1225119.166383] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[1225119.178563] md: data-check of RAID array md122
[1225119.178644] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.178735] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225119.178877] md: using 128k window, over a total of 71680902k.
[1225119.192249] md: data-check of RAID array md123
[1225119.192331] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.192422] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225119.192564] md: using 128k window, over a total of 71616384k.
[1225119.205378] md: data-check of RAID array md124
[1225119.205461] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.210851] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225119.216548] md: using 128k window, over a total of 35549071k.
[1225119.230698] md: data-check of RAID array md125
[1225119.236498] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.242171] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225119.247721] md: using 128k window, over a total of 35549071k.
[1225119.265408] md: data-check of RAID array md126
[1225119.270962] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.276552] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225119.282158] md: using 128k window, over a total of 71680902k.
[1225119.296781] md: data-check of RAID array md127
[1225119.302834] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225119.309046] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225119.315202] md: using 128k window, over a total of 71621696k.
[1225127.248170] md: md0: data-check done.
[1225127.284640] md: data-check of RAID array md1
[1225127.290534] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
[1225127.296453] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
[1225127.302521] md: using 128k window, over a total of 77623224k.
[1226450.960973] md: md1: data-check done.
[1226756.371905] md: md125: data-check done.
[1227131.051785] md: md127: data-check done.
[1227141.431898] md: md124: data-check done.
[1227808.276946] md: md126: data-check done.
[1227847.417636] md: md123: data-check done.
[1227978.744149] md: md122: data-check done.

Configuration of guest:

<domain type='kvm' id='4'>
  <name>olympic</name>
  <uuid>b42a5958-4258-1e0b-cbd0-05a0768c250b</uuid>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>1048576</currentMemory>
  <vcpu placement='static'>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-1.1'>hvm</type>
    <boot dev='hd'/>
  </os>
  <clock offset='utc' adjustment='reset'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70a'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70b'/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:34c'/>
      <target dev='vdc' bus='virtio'/>
      <alias name='virtio-disk2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:34d'/>
      <target dev='vdd' bus='virtio'/>
      <alias name='virtio-disk3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70e'/>
      <target dev='vde' bus='virtio'/>
      <alias name='virtio-disk4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-id/md-name-olympic:70f'/>
      <target dev='vdf' bus='virtio'/>
      <alias name='virtio-disk5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='00:16:3e:4b:86:33'/>
      <source bridge='calvaedi'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none'/>
</domain>
Comment 3 John Hughes 2015-08-12 09:11:39 UTC
What is the guest doing?

It's a NFSv4 server for my home directories.

It's my email server (sendmail and cyrus imap).

I've seen the problem (bug 89621) on the root filesystem, the home filesystem and the imap filesystem.

Most often on the root, a little less frequently the homedir filesystem.
Comment 4 Theodore Tso 2015-08-12 18:53:17 UTC
Can you upload your kernel .config file so I can try to replicate your environment as closely as possible?

-- Ted
Comment 5 John Hughes 2015-08-12 19:25:35 UTC
Created attachment 184741 [details]
config for kernel 3.18.19
Comment 6 John Hughes 2015-08-31 15:46:25 UTC
Well, it's just happened again.

[1728664.535798] EXT4-fs (dm-2): pa ffff88001ce2d368: logic 512, phys. 19774464, len 512
[1728664.538530] EXT4-fs error (device dm-2): ext4_mb_release_inode_pa:3773: group 603, free 449, pa_free 448
[1728664.541378] Aborting journal on device dm-2-8.
[1728664.549493] EXT4-fs (dm-2): Remounting filesystem read-only
[1728664.551281] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:4775: Journal has aborted
[1728664.590678] EXT4-fs error (device dm-2): ext4_journal_check_start:56: Detected aborted journal
[1728664.599749] EXT4-fs error (device dm-2) in ext4_orphan_del:2688: Journal has aborted
[1728664.612118] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:4775: Journal has aborted

I'll attach the fsck output.
Comment 7 John Hughes 2015-08-31 15:47:58 UTC
Created attachment 186381 [details]
fsck output after reboot.
Comment 8 Theodore Tso 2015-08-31 18:03:10 UTC
How many guest VM's are you running, and how often does the file system corruption happens?   It sounds like it's about once every 2-3 weeks?

Would you be willing to experiment running the VM without using the RAID1 so we're bypassing the MD layer?   One of the common factors between those people who are reporting this are (a) they are using KVM, and (b) they are using RAID.   Now, I'm doing all of my testing using KVM, and I've never seen anything like this.   So I wonder if the operative issue is happening at the MD layer....

You workload is common enough that I would expect that if this was happening for everyone who was using ext4, I should be getting a lot more bug reports.
Comment 9 John Hughes 2015-09-01 10:28:53 UTC
> How many guest VM's are you running

One at the moment.

> how often does the file system corruption happens?   It sounds like it's
> about once every 2-3 weeks?

With the Debian 3.16 kernel it was happening about once every 4 days.

I installed my 3.18.19 kernel on 24/7/2015 and have seen the bug twice:
11/8/2015 and 31/8/2015.

> Would you be willing to experiment running the VM without using the RAID1 so
> we're bypassing the MD layer?

I guess I'll have to try that.  It's going to be a monumental pain in the arse:

The guests /home is a 160GiB LVM volume striped across 4 mdadm raid1's.

What I can try first is just disabling one side of the mdadm mirrors, if the problem re-occurs I can get rid of mdadm completely.

(This will be the first time I've run one of my servers without disk mirroring since 1986.  Ugh.)
Comment 10 cwseys 2015-09-01 14:43:07 UTC
If you're getting failures faster in kernel 3.16 you might also get failures 
faster by backtracking
3.15.x
3.14.x
etc towards 3.2

until you have to wait long enough to be confident you've passed the bug.

Also, less surgery on Linux raid.

Chad.
Comment 11 John Hughes 2015-09-01 16:08:57 UTC
On 01/09/15 16:43, bugzilla-daemon@bugzilla.kernel.org wrote:
> If you're getting failures faster in kernel 3.16 you might also get failures
> faster by backtracking
> 3.15.x
> 3.14.x
> etc towards 3.2
Well, since a similar bug is known to be present before 3.16.something 
that doesn't seem like a good idea.

But you're right in one thing, since the bug happens faster with the 
Debian 3.16 than with my 3.18.19 I might try running that one without 
the raid.
Comment 12 John Hughes 2015-09-16 14:09:15 UTC
Ok, as of a few minutes ago I am running the Debian 3.16 kernel again, with all the mdadm devices unmirrored.

More news if I see the problem again.
Comment 13 Theodore Tso 2015-09-28 17:06:41 UTC
So it's been 12 days, and previously when you were using the Debian 3.16 kernel, it was triggering once every four days, right?  Can I assume that your silence indicates that you haven't seen a problem to date?

If so, then it really does seen that it might be an interaction between LVM/MD and KVM.

So if that's the case, then the next thing to ask is to try to figure out what might be the triggering cause.   A couple of things come to mind:

1) Some failure to properly handle a flush cache command being sent to the MD device.  This combined to either a power failure or a crash of the guest OS (depending on how KVM is configured), might explain a block update getting lost.   The fact that the block bitmap is out of sync with the block group descriptor is consistent with this failure.  However, if you were seeing failures once every four days, that would imply that the guest OS and/or host OS would be crashing at that or about that level of frequency, and you haven't reported that. 

2) Some kind a race between a 4k write and a RAID1 resync leading to a block write getting lost.  Again, this reported data corruption is consistent with this theory --- but this also requires the guest OS crashing due to some kind of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as in (1) above.  If you weren't seeing these failures once every four days or so, then this isn't a likely explanation.

3)  Some kind of corruption caused by the TRIM command being sent to the RAID/MD device, possibly racing with a block bitmap update.  This could be caused either by the file system being mounted with the -o discard mount option, or by fstrim getting run out of cron, or by e2fsck explicitly being asked to discard unused blocks (with the "-E discard" option).

4)  Some kind of bug which happens rarely either in qemu, the host kernel or the guest kernel depending on how it communicates with the virtual disk.  (i.e., virtio, scsi, ide, etc.)   Virtio is the most likely use case, and so trying to change to use scsi emulation might be interesting.  (OTOH, if the problem is specific to the MD layer, then this possibility is less likely.)

So as far as #3 is concerned, can you check to see if you had fstrim enabled, or are mounting the file system with -o discard?
Comment 14 John Hughes 2015-09-30 09:49:21 UTC
On 28/09/15 19:06, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=102731
>
> --- Comment #13 from Theodore Tso <tytso@mit.edu> ---
> So it's been 12 days, and previously when you were using the Debian 3.16
> kernel, it was triggering once every four days, right?  Can I assume that
> your
> silence indicates that you haven't seen a problem to date?

I haven't seen the problem, but unfortunately I'm running 3.18.19 at the 
moment (I screwed up on the last boot and let it boot the default 
kernel).  I haven't had time to reboot.  So I'd like to give it a bit 
more time.
>
> If so, then it really does seen that it might be an interaction between
> LVM/MD
> and KVM.
>
> So if that's the case, then the next thing to ask is to try to figure out
> what
> might be the triggering cause.   A couple of things come to mind:
>
> 1) Some failure to properly handle a flush cache command being sent to the MD
> device.  This combined to either a power failure or a crash of the guest OS
> (depending on how KVM is configured), might explain a block update getting
> lost.   The fact that the block bitmap is out of sync with the block group
> descriptor is consistent with this failure.  However, if you were seeing
> failures once every four days, that would imply that the guest OS and/or host
> OS would be crashing at that or about that level of frequency, and you
> haven't
> reported that.

I haven't had any host or guest crashes.

>
> 2) Some kind a race between a 4k write and a RAID1 resync leading to a block
> write getting lost.  Again, this reported data corruption is consistent with
> this theory --- but this also requires the guest OS crashing due to some kind
> of kernel crash or KVM/qemu shutdown and/or host OS crash / power failure, as
> in (1) above.  If you weren't seeing these failures once every four days or
> so,
> then this isn't a likely explanation.

No crashes.

>
> 3)  Some kind of corruption caused by the TRIM command being sent to the
> RAID/MD device, possibly racing with a block bitmap update.  This could be
> caused either by the file system being mounted with the -o discard mount
> option, or by fstrim getting run out of cron, or by e2fsck explicitly being
> asked to discard unused blocks (with the "-E discard" option).

I'm not using "-o discard", or fstrim, I've never used the "-E discard" 
option to fsck.
>
> 4)  Some kind of bug which happens rarely either in qemu, the host kernel or
> the guest kernel depending on how it communicates with the virtual disk.
> (i.e., virtio, scsi, ide, etc.)   Virtio is the most likely use case, and so
> trying to change to use scsi emulation might be interesting.  (OTOH, if the
> problem is specific to the MD layer, then this possibility is less likely.)
>
> So as far as #3 is concerned, can you check to see if you had fstrim enabled,
> or are mounting the file system with -o discard?
>

I'm a bit overwhelmed with work at the moment so I haven't had time to 
read this message with the care it deserves, I'll get back to you with 
more detail next week.
Comment 15 John Hughes 2015-10-07 16:17:04 UTC
On 28/09/15 19:06, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=102731
>
> --- Comment #13 from Theodore Tso <tytso@mit.edu> ---
> So it's been 12 days, and previously when you were using the Debian 3.16
> kernel, it was triggering once every four days, right?  Can I assume that
> your
> silence indicates that you haven't seen a problem to date?

As I said I was, due to a silly mistake (I rebooted the system and 
forgot that the default kernel was still the 3.18.19 one) not running 
the Debian kernel.

And, after 19 days the problem has shown up again:

Oct  7 17:34:24 olympic kernel: [1657429.788105] EXT4-fs (dm-2): pa ffff880004211a50: logic 512, phys. 10298368, len 512
Oct  7 17:34:24 olympic kernel: [1657429.790412] EXT4-fs error (device dm-2): ext4_mb_release_inode_pa:3773: group 314, free 497, pa_free 495
Oct  7 17:34:24 olympic kernel: [1657429.793168] Aborting journal on device dm-2-8.
Oct  7 17:34:24 olympic kernel: [1657429.795367] EXT4-fs (dm-2): Remounting filesystem read-only


All the filesystem where it happened, "dm-2", is a LVM volume:

# lvdisplay -m /dev/olympic/olympic-home
   --- Logical volume ---
   LV Path                /dev/olympic/olympic-home
   LV Name                olympic-home
   VG Name                olympic
   LV UUID                drA6nQ-zbcu-SDLc-UeXH-foML-fPcB-ve1HQf
   LV Write Access        read/write
   LV Creation host, time ,
   LV Status              available
   # open                 1
   LV Size                160.00 GiB
   Current LE             40960
   Segments               3
   Allocation             inherit
   Read ahead sectors     auto
   - currently set to     512
   Block device           253:2
    
   --- Segments ---
   Logical extent 0 to 34999:
     Type		striped
     Stripes		2
     Stripe size		64.00 KiB
     Stripe 0:
       Physical volume	/dev/vdf
       Physical extents	0 to 17499
     Stripe 1:
       Physical volume	/dev/vdb
       Physical extents	0 to 17499
    
   Logical extent 35000 to 40861:
     Type		striped
     Stripes		2
     Stripe size		64.00 KiB
     Stripe 0:
       Physical volume	/dev/vdd
       Physical extents	4339 to 7269
     Stripe 1:
       Physical volume	/dev/vdc
       Physical extents	4388 to 7318
    
   Logical extent 40862 to 40959:
     Type		striped
     Stripes		2
     Stripe size		64.00 KiB
     Stripe 0:
       Physical volume	/dev/vdd
       Physical extents	7270 to 7318
     Stripe 1:
       Physical volume	/dev/vdc
       Physical extents	4339 to 4387

The devices, vdf, vdb, vdd, vdc are all unmirrored:

    <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:70b'/>
       <target dev='vdb' bus='virtio'/>
       <alias name='virtio-disk1'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
     </disk>
     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:34c'/>
       <target dev='vdc' bus='virtio'/>
       <alias name='virtio-disk2'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
     </disk>
     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:34d'/>
       <target dev='vdd' bus='virtio'/>
       <alias name='virtio-disk3'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
     </disk>
     ...
     <disk type='block' device='disk'>
       <driver name='qemu' type='raw' cache='none'/>
       <source dev='/dev/disk/by-id/md-name-olympic:70f'/>
       <target dev='vdf' bus='virtio'/>
       <alias name='virtio-disk5'/>
       <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
     </disk>

lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:34c -> ../../md125
lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:34d -> ../../md124
lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:70b -> ../../md126
lrwxrwxrwx 1 root root 11 Oct  7 17:48 /dev/disk/by-id/md-name-olympic:70f -> ../../md122

md122 : active raid1 sda1[2] sdj1[3](F)
       71680902 blocks super 1.2 [2/1] [U_]
--
md124 : active raid1 sdg1[3] sdf1[2](F)
       35549071 blocks super 1.2 [2/1] [U_]
--
md125 : active raid1 sdh1[2] sde1[3](F)
       35549071 blocks super 1.2 [2/1] [U_]
--
md126 : active raid1 sdc1[0] sdi1[2](F)
       71680902 blocks super 1.2 [2/1] [U_]
Comment 16 John Hughes 2015-10-08 09:16:56 UTC
Minor detail:

The ext4 decided the filesystem was corrupt on 7/10, but on 4/10 I'd run an online check on it and it was OK.  (took a LVM snapshot, fsck'd that).

Oct  4 02:44:27 olympic online-check: .IN: Online fileystem check /dev/olympic/olympic-home
Oct  4 02:52:00 olympic online-check: .NO: Volume /dev/olympic/olympic-home checked OK
Oct  4 02:52:01 olympic online-check: .IN: Snapshot /dev/olympic/olympic-home-snap used 0.10%
Comment 17 Theodore Tso 2015-10-11 04:05:31 UTC
I just realized something which I forgot to ask you --- what file system are you using on the *host*?  You said that on the guest you are using ext3 file systems with the ext4 driver --- but what about on the host OS side?

I will say that using a full ext4 file system is far more stable on 3.18.21 than using an ext3 file system in compatibility mode.   Ext4 with the 4.2 kernel will support ext3 with zero unexpected test failures, but there are a handful of test failures that are showing up with the 3.18.21 kernel.  I've backported a few bug fixes, but they haven't shown up in the stable kernel series yet, and there are still a half-dozen test failures that I haven't had time to characterized yet.  (Note: support for long-term stable kernel is something I do as a very low priority background task.  My priority is upstream regressions and upstream development, and ideally someone would help with testing the stable kernels and identifying patches that require manual backports, but I haven't found a sucker yet.)

One of the reasons why I ask is that the PUNCH hole functionality was relatively new in the 3.18 kernel, and KVM uses it --- and I am suspicious that there might be some bug fixes that didn't land in the 3.18 kernel.   So one thing that might be worth trying is to get a 4.2.3 kernel for both your Host and Guest kernels, and see what that does for you.
Comment 18 John Hughes 2015-10-12 10:36:00 UTC
> I just realized something which I forgot to ask you --- what file system are
> you using on the *host*?  You said that on the guest you are using ext3 file
> systems with the ext4 driver --- but what about on the host OS side?

The volumes passed to the guest are just mdadm volumes, no fs from the 
point of view of the host.

The host filesystems are all ext3 (on lvm, on mdadm) , the host runs the 
debian 3.16 backport kernel, but it does very little filesystem I/O.

> I will say that using a full ext4 file system is far more stable on 3.18.21
> than using an ext3 file system in compatibility mode.
What's the magic command to convert an ext3 fs to ext4?  The wisdom of 
net seems to say:

tune2fs -O extents,uninit_bg,dir_index /dev/xxx
fsck.ext4 -yfD /dev/xxx

Or would it be better to make a new FS and copy the files?
> One of the reasons why I ask is that the PUNCH hole functionality was
> relatively new in the 3.18 kernel, and KVM uses it --- and I am suspicious
> that
> there might be some bug fixes that didn't land in the 3.18 kernel.
Was it used in 3.16?  I first saw the problem with debians 3.16 based 
kernel.

>     So one
> thing that might be worth trying is to get a 4.2.3 kernel for both your Host
> and Guest kernels, and see what that does for you.

Ok, I'll investigate that.
Comment 19 cwseys 2015-10-12 14:01:45 UTC
Might be interesting to convert the filesystem to XFS via a series of partition 
adds / copies / removes .  At least then the ext4 connection could be more 
firmly established.

(Actually, might want to "convert" to ext4 instead of XFS first and verify that 
the problem does not go away.)

Chad.
Comment 20 John Hughes 2015-10-15 15:32:16 UTC
So yesterday it happened again:

Oct 14 18:08:48 olympic kernel: [605450.518092] EXT4-fs error (device dm-2): ext4_mb_generate_buddy:757: group 434, block bitmap and bg descriptor inconsistent: 1834 vs 1833 free clusters
Oct 14 18:08:48 olympic kernel: [605450.522151] Aborting journal on device dm-2-8.
Oct 14 18:08:48 olympic kernel: [605450.524218] EXT4-fs (dm-2): Remounting filesystem read-only
Oct 14 18:08:48 olympic kernel: [605450.572244] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:4775: Journal has aborted
Oct 14 18:08:48 olympic kernel: [605450.577125] EXT4-fs error (device dm-2) in ext4_orphan_del:2688: Journal has aborted
Oct 14 18:08:48 olympic kernel: [605450.583858] EXT4-fs error (device dm-2) in ext4_reserve_inode_write:4775: Journal has aborted
Comment 21 John Hughes 2015-10-15 15:38:17 UTC
(In reply to cwseys from comment #19)
> Might be interesting to convert the filesystem to XFS via a series of
> partition adds / copies / removes .  At least then the ext4 connection
> could be more firmly established.

Well, since I've demirrored the system (ugh) I have enough space to make a copies of my filesystmes.

> (Actually, might want to "convert" to ext4 instead of XFS first and verify
> that the problem does not go away.)

I'm planning to move to ext4 over the weekend.  I've not yet decided whether to do it in-place or by copying the data to a new filesystem.

Will doing an in place conversion be enough?
Comment 22 John Hughes 2015-10-15 15:41:05 UTC
But on reflection maybe it's better to move to 4.2 as Theodore says:

"Ext4 with the 4.2 kernel will support ext3 with zero unexpected test failures, but there are a handful of test failures that are showing up with the 3.18.21 kernel."

Decisions, decisions.
Comment 23 John Hughes 2015-10-16 13:04:10 UTC
So I've booted the guest with kernel 4.3.2:

Linux olympic 4.2.3-jh1 #2 SMP Thu Oct 15 19:04:09 CEST 2015 x86_64 GNU/Linux

I'm leaving the host on debians 3.16.5-1~bpo70+1 kernel for the moment.

If I see the problem again I'll move the host to 4.2.3.

If I see the problem with host and guest at 4.2.3 I'll convert the filesystems (in place) to ext4.

If I see the problems with in-place converted ext4 I'll rebuild the filesysems as native ext4 and copy all the files over.

Seems like a good plan?
Comment 24 cwseys 2015-10-16 15:53:27 UTC
Hi John,
	Thanks for working on this!
	I've attempted to send debian infrastructure an email asking if they've 
noticed this problem.  If not, I'd like to find the difference.  (Maybe they 
don't use ext4 in VMs.)

> If I see the problem with host and guest at 4.2.3 I'll convert the
> filesystems (in place) to ext4.
> 
> If I see the problems with in-place converted ext4 I'll rebuild the
> filesysems as native ext4 and copy all the files over.

	I am using  formatted as ext4 "native ext4" and seem to see the same 
problem.  But will be interesting to hear your results over a couple months . 
:)

Chad.
Comment 25 cwseys 2015-10-16 16:14:24 UTC
P.S.
There is a bug report for Ubuntu VMs going read-only has similar backtraces. 

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1423672

There are some with "ext4_mb_generate_buddy" (NOT what John originally 
reported) and some with "ext4_mb_release_inode_pa" (John's original bug 
report).  They seem to be mutually exclusive - if one appears the other does 
not.

The ones with "ext4_mb_release_inode_pa" show up in kernel 3.13.? (Ubuntu's 
3.13.0-54-generic).

Theo said of backtraces containing "ext4_mb_generate_buddy" - "I believe the 
bug in question was fixed in a backport that first showed up in v3.16.2"

C.
Comment 26 John Hughes 2015-10-20 13:40:55 UTC
Well I've just had the problem happen when running the 4.2.3 kernel (on the guest, the host is still 3.16).

This time it was on the root filesystem, so no log messages unfortunately.

Will try and schedule some time to move the host to 4.2.3 this week.
Comment 27 John Hughes 2015-10-20 15:44:59 UTC
And now I have had (with 4.2.3) an error on my /home filesystem:

Oct 20 16:06:58 olympic kernel: [ 8027.714881] EXT4-fs error (device dm-2): ext4_mb_generate_buddy:758: group 519, block bitmap and bg descriptor inconsistent: 22589 vs 22588 free clusters
Oct 20 16:06:58 olympic kernel: [ 8027.718806] Aborting journal on device dm-2-8.
Oct 20 16:06:58 olympic kernel: [ 8027.722387] EXT4-fs (dm-2): Remounting filesystem read-only
Oct 20 16:06:58 olympic kernel: [ 8027.724291] EXT4-fs error (device dm-2) in ext4_free_blocks:4889: Journal has aborted
Oct 20 16:06:58 olympic kernel: [ 8027.727369] EXT4-fs error (device dm-2) in ext4_do_update_inode:4504: Journal has aborted
Oct 20 16:06:58 olympic kernel: [ 8027.735809] EXT4-fs error (device dm-2) in ext4_truncate:3789: IO failure
Oct 20 16:06:58 olympic kernel: [ 8027.739795] EXT4-fs error (device dm-2) in ext4_orphan_del:2895: Journal has aborted
Oct 20 16:06:58 olympic kernel: [ 8027.743834] EXT4-fs error (device dm-2) in ext4_do_update_inode:4504: Journal has aborted

Note that this is the ext4_mb_generate_buddy error cwseys mentioned.  AARGH!

When I ran fsck only one error was detected: "block bitmap differences +17016320"

As the system was down I've moved the host to 4.2.3 (the users were already screaming, so...)
Comment 28 John Hughes 2015-10-20 15:55:32 UTC
I've just noticed that the system had pretty low memory -- only 1G.  I've upped it to 2G.  Don't know whether that will make a difference.  I see some allocation failures, mostly relating to apache and tcp.

Oct 17 05:26:10 olympic kernel: [53046.469494] apache2: page allocation failure: order:1, mode:0x204020
Oct 17 05:26:10 olympic kernel: [53046.469812] apache2: page allocation failure: order:1, mode:0x204020
Oct 17 05:26:10 olympic kernel: [53046.469985] apache2: page allocation failure: order:1, mode:0x204020
Oct 17 05:26:10 olympic kernel: [53046.470154] apache2: page allocation failure: order:1, mode:0x204020
Oct 19 22:43:46 olympic kernel: [288101.737120] apache2: page allocation failure: order:1, mode:0x204020
Oct 19 22:43:46 olympic kernel: [288101.737833] kswapd0: page allocation failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.431636] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.431952] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.631774] swapper/0: page allocation failure: order:1, mode:0x204020
Oct 20 06:12:46 olympic kernel: [315042.635743] swapper/0: page allocation failure: order:1, mode:0x204020
Oct 20 07:03:32 olympic kernel: [318088.158197] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 07:03:32 olympic kernel: [318088.160930] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 07:03:32 olympic kernel: [318088.163835] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 10:44:54 olympic kernel: [331369.850853] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 10:44:54 olympic kernel: [331369.851186] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 10:44:54 olympic kernel: [331369.851357] apache2: page allocation failure: order:1, mode:0x204020
Oct 20 10:54:04 olympic kernel: [331920.651891] apache2: page allocation failure: order:1, mode:0x204020
Comment 29 cwseys 2015-10-20 16:28:08 UTC
If what we're seeing is the same thing...

8 of 10 VMs with <=1GB which have gone read-only.
0 of 2 VMs with >1GB where the problem has never occurred.

(This is an estimate:  May have missed some which have not gone read-only.)

So let's see what happens!
C.
Comment 30 cwseys 2015-10-20 16:30:46 UTC
But, FWIW awhile back I had a test VM with 1GB RAM and ran 'stress' on the VM 
for a couple of weeks with no problems.

C.
Comment 31 John Hughes 2015-11-25 10:09:44 UTC
Ok, its now been 35 days since the last problem.

I'm currently running:

Host: 4.2.3
Guest: 4.2.3

Unfortunately when I moved the host from 3.16 to 4.2.3 I also changed the amount of memory available for the guest, poor style for debugging but I was getting a lot of flack from disgruntled users.

I will now try moving the guest back to 3.16 with 1G of ram to see whether the problem was solved by changing the host kernel or by changing the guest memory size.

More news as it becomes available.
Comment 32 John Hughes 2016-01-19 12:00:10 UTC
Well I've (finally) rebooted with guest kernel 3.16 and 1Go memory.  It took a long time to convince the users to risk a (potentially) unstable system after a long time of all working OK.

For memory -- the problem went away when I moved the host kernel to 4.2.3 and increased the memory of the guest system to 2Go.  Changing the guest kernel from 4.2.3 to 3.16 *did not* make the problem go away.

Possible results --

1. problem does not show up, implies the fix is moving the host kernel to 4.2.3

2. problem does show up, implies the fix is either:

   a. increasing the guest memory to 2Go, or

   b. moving the host to 4.2.3 *and* increasing the guest memory to 2Go

   c. moving the guest to 4.2.3 *and* increasing the guest memory to 2Go

   d. moving the host to 4.2.3 *and* moving the guest to 4.2.3

   e. moving the host to 4.2.3 *and* moving the guest to 4.2.3 *and* increasing
      the guest memory to 2Go.

More news as it happens.
Comment 33 cwseys 2016-01-21 23:57:59 UTC
Hello,
	Could you summarize your findings?  It is hard to keep track after so much 
time.  :)
	Here is a beginning:

Hversion    Gversion  Gmem  read-only?   uptime
3.16           3.16         ????       yes               ~14 days
4.2.3          4.2.3        2 GB      no                ~60 days
4.2.3          3.16         ? GB      inprogress


Thanks!
Chad.
Comment 34 John Hughes 2016-01-22 10:27:24 UTC
I first started seeing the problem when I moved the guest kernel to 3.16 
(i.e. started using the ext4 driver for my ext3 fileystems).

1. Host 3.16, Guest 3.16, 1G memory, filesystem goes readonly after 
around 4 days

Then I upgraded the guest to 3.18.19

2. Host 3.16 Guest 3.18.19, 1G memory, filesystem goes readonly after 
around 20 days

Then I upgraded the guest to 4.3.2

3. Host 3.16 Guest 4.3.2, 1G memory, filesystem goes readonly after 4 days

Then I upgraded the host to 4.3.2, noticed the guest had only 1G of 
memory (and the guest kernel was bitching about not having enough memory 
during some TCP operations) so I increased the guest memory to 2G,  Bad 
debugging style, but I was desperate.

4. Host 4.3.2, Guest 4.3.2, 2G memory, system ran for 90 days with no 
problem.

So now I've set the guest to 3.16 with 1G of memory.  We'll see what 
happens next.
Comment 35 cwseys 2016-01-22 15:20:01 UTC
Thanks!
I think you mistyped 4.2.3 -> 4.3.2 .

Here's the table (helps me see any pattern b/c it is closer together.  Less 
memory required. :) )

Hversion    Gversion  Gmem  time before read-only
3.16        3.16      1 GB  ~4 days
3.16        3.18.19   1 GB  ~20 days
3.16        4.2.3     1 GB  4 days
4.2.3       4.2.3     2 GB  not after ~90 days
4.2.3       3.16      1 GB  inprogress


> Then I upgraded the host to 4.3.2, noticed the guest had only 1G of
> memory (and the guest kernel was bitching about not having enough memory
> during some TCP operations) so I increased the guest memory to 2G,  Bad
> debugging style, but I was desperate.

Debugging and production don't get along well together. :)

> So now I've set the guest to 3.16 with 1G of memory.  We'll see what
> happens next.

I haven't tried changing the host's kernel version yet. Will be interested to 
know how it turns out!  Thanks for debugging and hopefully you'll help us 
(Debian/Ubuntu users) find a workaround at least.

Thanks again,
Chad.
Comment 36 John Hughes 2016-01-22 16:36:39 UTC
Created attachment 200981 [details]
attachment-20424-0.html

On 22/01/16 16:20, bugzilla-daemon@bugzilla.kernel.org wrote:
> I think you mistyped 4.2.3 -> 4.3.2 .
Duh, yes.

# uname -a
Linux baltic*4.2.3*-jh1 #2 SMP Thu Oct 15 19:04:09 CEST 2015 x86_64 GNU/Linux
Comment 37 andre.arnold 2016-02-08 09:52:14 UTC
Hey all together,

thanks for the information provided, it helped me a lot fixing my VM crashes in KVM.

After downgrading the Gversion from 3.16 the crashes were gone.

Chad, whats your experience with your latest test? Running Hversion 4.3.2 and
Gversion on 3.16?

Thanks in advance.

Cheers
André
Comment 38 John Hughes 2016-02-08 10:56:09 UTC
My current status is:

Host: 4.2.3
Guest: 3.16.7

Guest uptime: 20 days.

I'm going to leave it like this for some more time, but my current 
feeling is that the ext4 subsystem in a KVM quest is only stable if the 
host kernel is > 4.something.  The ext3 subsystem did not seem to be 
vunerable to whatever the host was doing wrong before 4.whatever.
Comment 39 Vaclav Ovsik 2016-03-18 22:20:08 UTC
Hello,
please take a look on
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502
IMO the problem is in the hypervisor in 3.16 and older Intel CPU.
Currently I'm testing hypervisor 4.1 and it seems to be OK.
-- 
Zito
Comment 40 Vaclav Ovsik 2016-03-19 17:49:56 UTC
Issue attributes should be changed. This is not about filesystem in VM. The colleague had fail with btrfs in VM too...


-------- Přeposlaná zpráva --------
Předmět: 	[logcheck] ttrss.home.nahorany.net 2016-03-11 15:00 +0100 Security Events
Datum: 	Fri, 11 Mar 2016 15:00:16 +0100 (CET)
Od: 	logcheck system account <logcheck@home.nahorany.net>
Komu: 	centrino@nahorany.net



Security Events for kernel
=-=-=-=-=-=-=-=-=-=-=-=-=-
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 2195232
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 2198280
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 2973336
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 3351896
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 2198280
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 2973336
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 3351896

System Events
=-=-=-=-=-=-=
Mar 11 14:41:36 ttrss sshd[23130]: Received disconnect from 172.16.0.22: 11: disconnected by user
Mar 11 14:41:06 ttrss kernel: [2816393.980768] end_request: I/O error, dev vda, sector 2195232
Mar 11 14:23:32 ttrss ansible-command: Invoked with warn=True executable=None chdir=None _raw_params=apt-get update removes=None creates=None _uses_shell=False
Mar 11 14:25:14 ttrss ansible-command: Invoked with warn=True executable=None chdir=None _raw_params=apt-get upgrade -y removes=None creates=None _uses_shell=False
Mar 11 14:39:19 ttrss kernel: [2816286.989244] device-mapper: uevent: version 1.0.3
Mar 11 14:39:19 ttrss kernel: [2816287.638040] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30) initialised: dm-devel@redhat.com
Mar 11 14:39:19 ttrss kernel: [2816286.989244] device-mapper: uevent: version 1.0.3
Mar 11 14:39:19 ttrss kernel: [2816287.638040] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30) initialised: dm-devel@redhat.com
Mar 11 14:39:26 ttrss kernel: [2816295.076906] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
Mar 11 14:39:26 ttrss kernel: [2816295.076906] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
Mar 11 14:39:27 ttrss kernel: [2816295.173689] JFS: nTxBlock = 1938, nTxLock = 15510
Mar 11 14:39:27 ttrss kernel: [2816295.414255] ntfs: driver 2.1.30 [Flags: R/W MODULE].
Mar 11 14:39:27 ttrss kernel: [2816295.721879] QNX4 filesystem 0.2.3 registered.
Mar 11 14:39:27 ttrss kernel: [2816295.173689] JFS: nTxBlock = 1938, nTxLock = 15510
Mar 11 14:39:27 ttrss kernel: [2816295.414255] ntfs: driver 2.1.30 [Flags: R/W MODULE].
Mar 11 14:39:27 ttrss kernel: [2816295.721879] QNX4 filesystem 0.2.3 registered.
Mar 11 14:39:28 ttrss kernel: [2816296.129309] fuse init (API version 7.23)
Mar 11 14:39:28 ttrss kernel: [2816296.129309] fuse init (API version 7.23)
Mar 11 14:39:28 ttrss os-prober: debug: /dev/vda1: is active swap
Mar 11 14:39:29 ttrss ansible-command: Invoked with warn=True executable=None chdir=None _raw_params=apt-get -y upgrade removes=None creates=None _uses_shell=False
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:41:06 ttrss kernel: [2816393.980768] BTRFS: bdev /dev/vda2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Mar 11 14:45:30 ttrss ansible-command: Invoked with warn=True executable=None chdir=None _raw_params=apt-get -y upgrade removes=None creates=None _uses_shell=False

-- 
Zito
Comment 41 Theodore Tso 2016-03-20 01:27:58 UTC
I'm not sure we need to leave this bug open at all.  It sounds like it's some combination of a buggy Intel CPU and a buggy kernel code in the KVM/virtualization support.   The fix seems to be to update to a newer host kernel (4.1 or newer).

This is problematic for Debian users, but at this point it's a bug for the Debian kernel maintainers, and this bugzilla is for the upstream kernel....
Comment 42 Vaclav Ovsik 2016-03-20 23:26:49 UTC
Yes, you are right. I found a commit https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?h=7dec5603b6b8dc4c3e1c65d318bd2a5a8c62a424 using git bisect. I hope this commit can be used for Debian kernel package.
Thanks
Comment 43 Theodore Tso 2016-03-21 13:04:06 UTC
Vaclav, I want to really, really thank you for doing the bisect.   Looking at the testing procedure you used for the git bisect here:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=818502

I'm quite confident you've found the commit that needs to be backported into the Debian 3.16 kernel.  Given that the commit was added in 3.17, it should be fairly easy to backport it into the Debian kernel.

It's really too bad this commit hadn't been marked with a cc:stable@vger.kernel.org tag, but to be fair I've sometimes made this mistake, either out of sheer forgetfulness or because I didn't recognize the seriousness of the bugs that a commit could address.   To be fair to the KVM maintainers, the commit description doesn't really list the potential impacts of the bug it was fixing.

Also with 20-20 hindsight, it's perhaps unfortunate that during this time period there is a real divergence[1] of kernels that were used by distributions.  So a bug that would only show up on certain generations of Intel chipsets, and only when used in virtualization as a host, is precisely the sort of bug that it is not likely to be noticed until it goes into enterprise-wide deployment --- and so the fact that other distributions didn't standardize on a single kernel in this time period (and Debian standardized on the oldest kernel in this rough time period, 3.16, and the bug in question was fixed in 3.17) meant that it's not all that surprising that this slipped through.   And while it would have been more convenient if Debian had been willing to switch over to a 3.18 based stable series, it wasn't compatible with their release schedule.

[1] Ubuntu 14.04 LTS stayed on 3.16 for only 18 months before moving to 3.19;  Fedora 20 was on 3.11, and Fedora 21 jumped to 3.17, etc.

Vaclav, thanks again for finding a simple/easy repro, and then bisecting until you found the commit that is needed for the 3.16 debian kernel!
Comment 44 cwseys 2016-03-25 16:55:03 UTC
Any ideas why ext4 in guests with kernel version 3.2.x (Wheezy) did not go 
read-only, but guests with 3.16.x (Jessie) did?

I suppose running a vmhost with kernel 3.16 and bisecting the guest's kernel 
from 3.2 to 3.16 would give us this answer.

But not sure how interesting it would be given that the root cause is KVM not 
simulating real machine correctly.

C.
Comment 45 John Hughes 2016-04-08 15:49:34 UTC
Sorry for my lack of reactivity, I've been a bit ill (had a terrible cough, actually).

I've been running with a 4.2.3 kernel in the host since 19/1/2016 (around 80 days) with no problems.

Yesterday my host was rebooted (power cut) and ran with the 3.16 kernel for one day and I saw the problem again.

Based on that, and Vaclav's superb research I am convinced that the problem is in the host system, probably fixed by the patch he found.

Note You need to log in before you can comment on or make changes to this bug.