Bug 206031 - btrfs send/receive out of memory with 5.5.0-rc1; works with <=5.4
Summary: btrfs send/receive out of memory with 5.5.0-rc1; works with <=5.4
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: btrfs (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: BTRFS virtual assignee
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-12-31 15:37 UTC by Craig Andrews
Modified: 2022-10-06 22:00 UTC (History)
5 users (show)

See Also:
Kernel Version: 5.5.0-rc4
Subsystem:
Regression: No
Bisected commit-id:


Attachments
kernel logs (301.42 KB, text/plain)
2019-12-31 15:40 UTC, Craig Andrews
Details
config.gz (28.54 KB, application/gzip)
2019-12-31 15:40 UTC, Craig Andrews
Details
.config for kernel 5.5.6 (23.48 KB, application/gzip)
2020-02-29 08:55 UTC, Mark Nowiasz
Details

Description Craig Andrews 2019-12-31 15:37:00 UTC
If I perform a btrfs send receive like so:

sh -c btrfs send -p /mnt/everything/.snapshots/root.20191230 /mnt/everything/.snapshots/root.20191231 | btrfs receive /mnt/backup/.snapshots/

On Linux 5.4.0, the process completes successfully.

Starting with Linux 5.5.0-rc1 up to the current 5.5 rc, 5.5.0-rc4, he result is the OOM killer being invoked which (among other process carnage) kills the btrfs processes stopping the backup.

I'm using the same kernel config, same hardware, etc in the two tests. The system has 16 GB of RAM, CPU is an i5-6500, arch is amd64.
/mnt/everything is a btrfs fs on luks (/dev/mapper/sda4) on a 1 TB SATA SSD (/dev/sda4). /mnt/backup is btfs fs on luks (/dev/mapper/backup) on a 2 TB external spinning rust HDD (/dev/sdb).

# df -H /mnt/everything/ /mnt/backup
Filesystem          Size  Used Avail Use% Mounted on
/dev/mapper/sda4    949G  698G  247G  74% /mnt/everything
/dev/mapper/backup  2.1T  1.1T  948G  53% /mnt/backup

I'll attach some logs and my kernel config.
Comment 1 Craig Andrews 2019-12-31 15:40:02 UTC
Created attachment 286545 [details]
kernel logs

The backup is started by cron at Dec 31 03:09.

It looks like things start going sideways at Dec 31 04:08:51:
Dec 31 04:08:51 irrational kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Dec 31 04:08:51 irrational kernel: rcu:         1-....: (17999 ticks this GP) idle=51a/1/0x4000000000000002 softirq=2375493/2375493 fqs=5999 
Dec 31 04:08:51 irrational kernel:         (t=18000 jiffies g=2154845 q=641956)
Dec 31 04:08:51 irrational kernel: NMI backtrace for cpu 1
Dec 31 04:08:51 irrational kernel: CPU: 1 PID: 16900 Comm: btrfs Not tainted 5.5.0-rc4 #2
Dec 31 04:08:51 irrational kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z170 Gaming-ITX/ac, BIOS P7.30 01/23/2018
Dec 31 04:08:51 irrational kernel: Call Trace:
Dec 31 04:08:51 irrational kernel:  <IRQ>
Dec 31 04:08:51 irrational kernel:  dump_stack+0x50/0x68
Dec 31 04:08:51 irrational kernel:  nmi_cpu_backtrace.cold+0x14/0x53
Dec 31 04:08:51 irrational kernel:  ? lapic_can_unplug_cpu.cold+0x3e/0x3e
Dec 31 04:08:51 irrational kernel:  nmi_trigger_cpumask_backtrace+0x8a/0x97
Dec 31 04:08:51 irrational kernel:  rcu_dump_cpu_stacks+0x7b/0xa9
Dec 31 04:08:51 irrational kernel:  rcu_sched_clock_irq.cold+0x153/0x38a
Dec 31 04:08:51 irrational kernel:  ? __cgroup_account_cputime_field+0x3a/0x50
Dec 31 04:08:51 irrational kernel:  update_process_times+0x1f/0x50
Dec 31 04:08:51 irrational kernel:  tick_sched_timer+0x33/0x70
Dec 31 04:08:51 irrational kernel:  ? tick_sched_do_timer+0x60/0x60
Dec 31 04:08:51 irrational kernel:  __hrtimer_run_queues+0xec/0x190
Dec 31 04:08:51 irrational kernel:  hrtimer_interrupt+0x10c/0x240
Dec 31 04:08:51 irrational kernel:  smp_apic_timer_interrupt+0x61/0xa0
Dec 31 04:08:51 irrational kernel:  apic_timer_interrupt+0xf/0x20
Dec 31 04:08:51 irrational kernel:  </IRQ>
Comment 2 Craig Andrews 2019-12-31 15:40:47 UTC
Created attachment 286547 [details]
config.gz
Comment 3 Craig Andrews 2020-01-12 13:57:58 UTC
I re-tested with 5.5.0-rc5 and the issue still exists.
Comment 4 Craig Andrews 2020-01-28 14:45:06 UTC
I applied that patch from https://patchwork.kernel.org/patch/11350129/ to 5.5.0-rc7 and that solved the problem. I can now do backups (which is a send/receive) successfully.
Comment 5 David Sterba 2020-02-28 18:55:48 UTC
Thanks for the report and testing. Fixed in stable 5.5.3, closing.
Comment 6 Mark Nowiasz 2020-02-29 08:22:22 UTC
(In reply to David Sterba from comment #5)
> Thanks for the report and testing. Fixed in stable 5.5.3, closing.

Excuse me, but I *strongly* disagree.

I've got *exactly* the same problem, and tried every kernel in 5.5.x and the problem still exists - at least in 5.5.6:

[14608.466441] rcu: INFO: rcu_sched self-detected stall on CPU
[14608.466820] rcu:     0-....: (1 GPs behind) idle=cb2/1/0x4000000000000002 softirq=1223126/1223127 fqs=5247 
[14608.467232]  (t=21001 jiffies g=2521121 q=630)
[14608.467483] NMI backtrace for cpu 0
[14608.467729] CPU: 0 PID: 24168 Comm: btrfs Not tainted 5.5.1-gentoo #1
[14608.468036] Hardware name: netcup KVM Server, BIOS RS 1000 SAS G8 09/27/2018
[14608.468370] Call Trace:
[14608.468575]  <IRQ>
[14608.468778]  dump_stack+0x50/0x70
[14608.468991]  nmi_cpu_backtrace.cold+0x14/0x53
[14608.469271]  ? lapic_can_unplug_cpu.cold+0x43/0x43
[14608.470013]  nmi_trigger_cpumask_backtrace+0x80/0x8d
[14608.470801]  rcu_dump_cpu_stacks+0x7b/0xa9
[14608.471519]  rcu_sched_clock_irq.cold+0x153/0x395
[14608.472248]  ? tick_sched_do_timer+0x50/0x50
[14608.472959]  ? tick_sched_do_timer+0x50/0x50
[14608.473670]  update_process_times+0x1f/0x50
[14608.474478]  tick_sched_timer+0x33/0x70
[14608.475202]  __hrtimer_run_queues+0xdf/0x170
[14608.475915]  hrtimer_interrupt+0x101/0x220
[14608.476627]  smp_apic_timer_interrupt+0x48/0x80
[14608.477356]  apic_timer_interrupt+0xf/0x20
[14608.478062]  </IRQ>
[14608.478751] RIP: 0010:do_splice+0x10d/0x5e0
[14608.479468] Code: 8b 4f 48 8b 75 3c 45 8b 6f 38 44 8d 58 ff 41 8b 47 44 85 c9 8d 78 ff 0f 84 72 02 00 00 39 75 38 41 8b 47 3c 0f 84 9e 02 00 00 <44> 89 ea 29 c2 41 39 57 40 0f 86 9a 02 00 00 44 89 d8 21 f0 48 8d
[14608.481616] RSP: 0018:ffff88821699be90 EFLAGS: 00000202 ORIG_RAX: fffffffffff

(5.5.6)

So the problem hasn't been resolved in 5.5.3+  - I strongly request reopening this bug, I'm currently unable to do any btrfs backups (using btrfs send) on a 5.5.x (up to 5.5.6, haven't tested 5.5.7 yet because I don't see the point) kernel
Comment 7 Mark Nowiasz 2020-02-29 08:55:36 UTC
Created attachment 287709 [details]
.config for kernel 5.5.6

.config for kernel 5.5.6, still crashes when starting btrfs-send (via btrbk)
Comment 8 Mark Nowiasz 2020-02-29 10:12:30 UTC
Sorry, wrong dmesg (the previous was for 5.5.1).

5.5.6:

[45151.430935] Call Trace:
[45151.431702]  <IRQ>
[45151.432401]  dump_stack+0x50/0x70
[45151.433095]  nmi_cpu_backtrace.cold+0x14/0x53
[45151.433823]  ? lapic_can_unplug_cpu.cold+0x43/0x43
[45151.434545]  nmi_trigger_cpumask_backtrace+0x80/0x8d
[45151.435316]  rcu_dump_cpu_stacks+0x7b/0xa9
[45151.436024]  rcu_sched_clock_irq.cold+0x153/0x395
[45151.436768]  ? tick_sched_do_timer+0x50/0x50
[45151.437471]  ? tick_sched_do_timer+0x50/0x50
[45151.438182]  update_process_times+0x1f/0x50
[45151.438899]  tick_sched_timer+0x33/0x70
[45151.439673]  __hrtimer_run_queues+0xdf/0x170
[45151.440446]  hrtimer_interrupt+0x101/0x220
[45151.441383]  smp_apic_timer_interrupt+0x48/0x80
[45151.442340]  apic_timer_interrupt+0xf/0x20
[45151.443261]  </IRQ>
[45151.444106] RIP: 0010:mutex_unlock+0x13/0x20
[45151.445039] Code: 43 10 48 89 e7 48 8b 70 10 e8 d9 43 82 ff eb ae 0f 1f 80 00 00 00 00 65 48 8b 14 25 00 7f 01 00 31 c9 48 89 d0 f0 48 0f b1 0f <48> 39 c2 74 05 e9 f3 fe ff ff c3 66 90 48 8b 47 20 48 85 c0 74 0f
[45151.447854] RSP: 0018:ffff8881ccd9fe88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[45151.449560] RAX: ffff888236e713c0 RBX: 0000000000000000 RCX: 0000000000000000
[45151.451255] RDX: ffff888236e713c0 RSI: 0000000000002729 RDI: ffff8881ccc73300
[45151.452933] RBP: ffff8881f1da5cc0 R08: 0000000000010000 R09: 0000000000000004
[45151.454628] R10: ffff8881fc1bd400 R11: 000000000000000f R12: 0000000000010000
[45151.456323] R13: 0000000000002729 R14: 0000000000000001 R15: ffff8881ccc73300
[45151.459938]  do_splice+0x3da/0x5e0
[45151.460848]  __x64_sys_splice+0xed/0x110
[45151.461777]  do_syscall_64+0x3d/0x100
[45151.462798]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[45151.464035] RIP: 0033:0x7f0c1903585e
[45151.464948] Code: e8 67 fd f7 ff 89 c5 44 8b 4c 24 2c 4c 8b 44 24 20 4c 8b 54 24 18 8b 54 24 28 48 8b 74 24 10 8b 7c 24 08 b8 13 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 32 89 ef 48 89 44 24 08 e8 8e fd f7 ff 48 8b
[45151.467761] RSP: 002b:00007f0c18f28d10 EFLAGS: 00000293 ORIG_RAX: 0000000000000113
[45151.469514] RAX: ffffffffffffffda RBX: 00007ffefc87a4d0 RCX: 00007f0c1903585e
[45151.471315] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
[45151.472997] RBP: 0000000000000000 R08: 0000000000010000 R09: 0000000000000004
[45151.474683] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffefc87a3ce
[45151.476374] R13: 00007ffefc87a3cf R14: 00007f0c18f29700 R15: 00007f0c18f28e40
[45214.425604] rcu: INFO: rcu_sched self-detected stall on CPU
[45214.470336] rcu:     1-....: (1 GPs behind) idle=06e/1/0x4000000000000002 softirq=23549806/23549807 fqs=36142                                                
[45214.471708]  (t=147006 jiffies g=17081825 q=238443)
[45214.472451] NMI backtrace for cpu 1
[45214.473192] CPU: 1 PID: 53688 Comm: btrfs Not tainted 5.5.6-gentoo #1
[45214.474015] Hardware name: netcup KVM Server, BIOS RS 1000 SAS G8 09/27/2018
[45214.486203] Call Trace:
[45214.486918]  <IRQ>
[45214.487669]  dump_stack+0x50/0x70
[45214.488367]  nmi_cpu_backtrace.cold+0x14/0x53
[45214.489172]  ? lapic_can_unplug_cpu.cold+0x43/0x43
[45214.489909]  nmi_trigger_cpumask_backtrace+0x80/0x8d
[45214.490671]  rcu_dump_cpu_stacks+0x7b/0xa9
[45214.491399]  rcu_sched_clock_irq.cold+0x153/0x395
[45214.492221]  ? tick_sched_do_timer+0x50/0x50
[45214.492985]  ? tick_sched_do_timer+0x50/0x50
[45214.493694]  update_process_times+0x1f/0x50
[45214.494405]  tick_sched_timer+0x33/0x70
[45214.495102]  __hrtimer_run_queues+0xdf/0x170
[45214.498680]  hrtimer_interrupt+0x101/0x220
[45214.499392]  smp_apic_timer_interrupt+0x48/0x80
[45214.500117]  apic_timer_interrupt+0xf/0x20
[45214.501033]  </IRQ>
[45214.501669] RIP: 0010:mutex_unlock+0x13/0x20
[45214.502484] Code: 43 10 48 89 e7 48 8b 70 10 e8 d9 43 82 ff eb ae 0f 1f 80 00 00 00 00 65 48 8b 14 25 00 7f 01 00 31 c9 48 89 d0 f0 48 0f b1 0f <48> 39 c2 74 05 e9 f3 fe ff ff c3 66 90 48 8b 47 20 48 85 c0 74 0f
[45214.504766] RSP: 0018:ffff8881ccd9fe88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[45214.506054] RAX: ffff888236e713c0 RBX: 0000000000000000 RCX: 0000000000000000
[45214.507323] RDX: ffff888236e713c0 RSI: 0000000000002729 RDI: ffff8881ccc73300
[45214.508628] RBP: ffff8881f1da5cc0 R08: 0000000000010000 R09: 0000000000000004
[45214.509933] R10: ffff8881fc1bd400 R11: 000000000000000f R12: 0000000000010000
[45214.511205] R13: 0000000000002729 R14: 0000000000000001 R15: ffff8881ccc73300
[45214.512558]  do_splice+0x3da/0x5e0
[45214.513262]  __x64_sys_splice+0xed/0x110
[45214.513981]  do_syscall_64+0x3d/0x100
[45214.514677]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[45214.515416] RIP: 0033:0x7f0c1903585e
[45214.516121] Code: e8 67 fd f7 ff 89 c5 44 8b 4c 24 2c 4c 8b 44 24 20 4c 8b 54 24 18 8b 54 24 28 48 8b 74 24 10 8b 7c 24 08 b8 13 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 32 89 ef 48 89 44 24 08 e8 8e fd f7 ff 48 8b
[45214.518344] RSP: 002b:00007f0c18f28d10 EFLAGS: 00000293 ORIG_RAX: 0000000000000113
[45214.519628] RAX: ffffffffffffffda RBX: 00007ffefc87a4d0 RCX: 00007f0c1903585e
[45214.520965] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
[45214.522244] RBP: 0000000000000000 R08: 0000000000010000 R09: 0000000000000004
[45214.523529] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffefc87a3ce
[45214.532254] R13: 00007ffefc87a3cf R14: 00007f0c18f29700 R15: 00007f0c18f28e40
[45219.364935] btrfs (53683) used greatest stack depth: 12040 bytes left
Comment 9 vladi 2020-03-01 02:03:51 UTC
with 5.5.7 during a btrbk send/receive:

[14021.085239] ------------[ cut here ]------------                                                                                                                                                                                    [0/628]
[14021.085256] NETDEV WATCHDOG: van0 (igb): transmit queue 0 timed out                                                                                                                                                                        
[14021.085278] WARNING: CPU: 4 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0x1fb/0x200                                                                                                                                                 
[14021.085281] Modules linked in: wireguard(O) ip6_udp_tunnel udp_tunnel tcp_diag udp_diag raw_diag inet_diag netlink_diag nfnetlink_queue xt_nat macvlan veth ip6table_filter ip6_tables nf_conntrack_netlink nfnetlink bridge stp llc 8021q 
iptable_raw nf_log_ipv4 nf_log_common xt_LOG xt_limit ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_NFQUEUE xt_mark ipv6 xt_conntrack iptable_filter xt_MASQUERADE xt_addrtype iptable_nat nf_nat iptable_mangle ip_tables x_tables sch_fq_codel nf_c
onntrack_sip nf_conntrack nf_defrag_ipv4 nf_defrag_ipv6 amdgpu kvm_amd tun ccp snd_hda_codec_hdmi mfd_core gpu_sched kvm ttm snd_hda_intel efi_pstore snd_intel_dspcfg irqbypass drm_kms_helper sha1_generic snd_hda_codec tcp_cubic syscopyar
ea snd_hda_core sysfillrect sysimgblt efivars fb_sys_fops k10temp snd_pcm snd_timer drm snd backlight soundcore cp210x evdev usbserial bfq acpi_cpufreq button algif_rng algif_aead algif_hash algif_skcipher af_alg overlay virtiofs fuse ext
4 mbcache jbd2 dm_thin_pool dm_persistent_data                                                                                                                                                                                                
[14021.085336]  dm_snapshot dm_bufio dm_bio_prison usb_storage sr_mod cdrom virtio_crypto crypto_engine virtio_mmio virtio_pci virtio_input virtio_balloon virtio_rng virtio_console virtio_blk virtio_ring virtio          
[14021.085350] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G           O      5.5.7 #65                                                                                                                                                            
[14021.085355] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5220 09/12/2019                                                                                                      
[14021.085358] RIP: 0010:dev_watchdog+0x1fb/0x200                                                                                                                                                                                             
[14021.085362] Code: 49 63 56 e0 eb 93 4c 89 ef c6 05 1f f6 7a 00 01 e8 8a cc fc ff 44 89 e1 48 89 c2 4c 89 ee 48 c7 c7 40 6b 18 8c e8 15 41 9f ff <0f> 0b eb bf 90 48 83 ec 40 48 89 5c 24 10 48 89 6c 24 18 48 89 cb
[14021.085363] RSP: 0018:ffff9f5700204e98 EFLAGS: 00010282                                                                                                                                                                                    
[14021.085365] RAX: 0000000000000000 RBX: ffff9bcdca4c88c0 RCX: 0000000000000933                                                                                                                                
[14021.085366] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff8c5c456c                                                                                                                                                              
[14021.085367] RBP: ffff9bcdcb66041c R08: 0000000000000933 R09: 0000000000000001                                                                                                                                
[14021.085368] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000                                                                                                                                                              
[14021.085369] R13: ffff9bcdcb660000 R14: ffff9bcdcb660440 R15: 0000000000000008                                                                                                                                
[14021.085370] FS:  0000000000000000(0000) GS:ffff9bcdd0f00000(0000) knlGS:0000000000000000                                                                                                                                                   
[14021.085372] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                                                                
[14021.085373] CR2: 00007f68c8000010 CR3: 00000003bf346000 CR4: 00000000003406e0                                                                                                                                                              
[14021.085374] Call Trace:                                                                                                                                                                                      
[14021.085376]  <IRQ>                                                                                                                                                                                                                         
[14021.085379]  ? pfifo_fast_enqueue+0x180/0x180                                                                                                                                                                
[14021.085383]  call_timer_fn.isra.0+0x11/0x70                                                                                                                                                                                                
[14021.085386]  run_timer_softirq+0x342/0x390                                                                                                                                                                   
[14021.085389]  ? tick_sched_timer+0x40/0x90                                                                                                                                                                                                  
[14021.085391]  ? tick_sched_do_timer+0x70/0x70                                                                                                                                                                                               
[14021.085394]  ? sched_clock+0x5/0x10                                                                                                                                                                                                        
[14021.085397]  __do_softirq+0xd7/0x22c                                                                                                                                                                         
[14021.085401]  irq_exit+0xe5/0xf0                                                                                                                                                                                                            
[14021.085403]  smp_apic_timer_interrupt+0x66/0xa0                                                                                                                                                                                            
[14021.085406]  apic_timer_interrupt+0xf/0x20                                                                                                                                                                                                 
[14021.085407]  </IRQ>                                                                                                                                                                                                                        
[14021.085410] RIP: 0010:cpuidle_enter_state+0x138/0x260                                                                                                                                                                                      
[14021.085412] Code: e8 ed 0f aa ff 31 ff 48 89 c5 e8 73 28 aa ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 0b 01 00 00 31 ff e8 3c 9a ae ff fb 45 85 ed <0f> 88 cf 00 00 00 49 63 d5 bf 68 00 00 00 48 89 e8 48 89 d1 48 2b        
[14021.085413] RSP: 0018:ffff9f5700137e78 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13                                                                                                                                                         
[14021.085415] RAX: ffff9bcdd0f217c0 RBX: ffffffff8c241da0 RCX: 000000000000001f                                                                                                                                                              
[14021.085416] RDX: 0000000000000000 RSI: 0000000023975c86 RDI: 0000000000000000                                                                                                                                                              
[14021.085417] RBP: 00000cc089cde165 R08: 00000cc089cde165 R09: 0000000000000068                                                                                                                                                              
[14021.085418] R10: ffff9bcdd0f20984 R11: ffff9bcdd0f20964 R12: ffff9bcdca4a7c00                                                                                                                                                              
[14021.085419] R13: 0000000000000002 R14: 0000000000000000 R15: ffffffff8c241e88                                                                                                                                                              
[14021.085422]  ? cpuidle_enter_state+0x11d/0x260                                                                                                                                                                                             
[14021.085424]  cpuidle_enter+0x32/0x50                                                                                                                                                                         
[14021.085426]  do_idle+0x1c3/0x230                                                                                                                                                                             
[14021.085429]  cpu_startup_entry+0x14/0x20                                                                            
[14021.085432]  start_secondary+0x143/0x170                                                                                                                                                                                                   
[14021.085435]  secondary_startup_64+0xa4/0xb0                                                                                                                                                                                                
[14021.085437] ---[ end trace 1d715de8f034f47c ]---                                                                                                                                                                                           
[14021.085677] igb 0000:06:00.0 van0: Reset adapter                                                                                                                                                                                           
[14023.003520] igb 0000:06:00.0 van0: igb: van0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX                                                                                                                                     
[14028.203200] igb 0000:06:00.0 van0: igb: van0 NIC Link is Down                                                                                                                                                                              
[14030.881395] igb 0000:06:00.0 van0: igb: van0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Comment 10 David Sterba 2020-03-02 13:39:06 UTC
Well it could happen that there are some things to fix but if the original reporter tests a specific patch and says it worked there's nothing else then to close the bug.

The log excerpt is a generic warning from RCU, this is the same as the original report but otherwise does not point to btrfs and the OOM condition seems to be somehow implied by the overall system state (in both cases).

Looking again at the logs in comment 1:

Dec 31 02:12:53 irrational kernel: ata3.00: cmd 61/08:00:08:78:4c/00:00:4f:00:00/40 tag 0 ncq dma 4096 out
                                            res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)


Dec 31 04:07:58 irrational kernel: iwlwifi 0000:04:00.0: Error sending STATISTICS_CMD: time out after 2000ms.

Dec 31 10:18:29 irrational kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out

So there are several subsystems that report timeouts, before/after/during the system goes OOM. This rather points to a flaky hardware or buggy drivers (networking).

There's no btrfs function in any of the stacktraces so the memory exhaustion is either from another filesystem or indirectly caused by btrfs, eg. when send starts allocates a lot of memory, finishes and there's too little left for the rest of the system.

So if there's more to fix it needs to be analyzed again perhaps with more logs because what's provided until now is not sufficient. If it would be possible, please get a few samples of /proc/slabinfo (even one sample would be good in case the system in OOM becomes unusable).
Comment 11 Craig Andrews 2020-03-02 14:10:56 UTC
Original reporter here.

The patch seemed to have improved things, but there's still a problem.

Before the patch, my system would reliably crash during the backup process.

With the patch (currently using 5.6.0-rc3), the backups sometimes work and sometimes don't. I haven't yet been able to determine why it sometimes works and sometimes crashes. To be clear, I can boot 5.6.0-rc3, run a backup, and sometimes it will success (maybe 50% of the time?), and sometimes it will fail. When it fails, it does so with an RCU stack trace logged, very similar to https://bugzilla.kernel.org/show_bug.cgi?id=206031#c9 Here's the exact output from my system:
---
Feb 29 03:12:15 irrational kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Feb 29 03:12:15 irrational kernel: rcu:         1-....: (17999 ticks this GP) idle=ba6/1/0x4000000000000002 softirq=4912717/4912717 fqs=5998 
Feb 29 03:12:15 irrational kernel:         (t=18000 jiffies g=7054825 q=1000286)
Feb 29 03:12:15 irrational kernel: NMI backtrace for cpu 1
Feb 29 03:12:15 irrational kernel: CPU: 1 PID: 170362 Comm: btrfs Tainted: G          I       5.6.0-rc3 #2
Feb 29 03:12:15 irrational kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z170 Gaming-ITX/ac, BIOS P7.30 01/23/2018
Feb 29 03:12:15 irrational kernel: Call Trace:
Feb 29 03:12:15 irrational kernel:  <IRQ>
Feb 29 03:12:15 irrational kernel:  dump_stack+0x50/0x68
Feb 29 03:12:15 irrational kernel:  nmi_cpu_backtrace.cold+0x14/0x53
Feb 29 03:12:15 irrational kernel:  ? lapic_can_unplug_cpu.cold+0x3e/0x3e
Feb 29 03:12:15 irrational kernel:  nmi_trigger_cpumask_backtrace+0x8a/0x97
Feb 29 03:12:15 irrational kernel:  rcu_dump_cpu_stacks+0x7b/0xa9
Feb 29 03:12:15 irrational kernel:  rcu_sched_clock_irq.cold+0x153/0x38a
Feb 29 03:12:15 irrational kernel:  ? __cgroup_account_cputime_field+0x3a/0x50
Feb 29 03:12:15 irrational kernel:  update_process_times+0x1f/0x50
Feb 29 03:12:15 irrational kernel:  tick_sched_timer+0x33/0x70
Feb 29 03:12:15 irrational kernel:  ? tick_sched_do_timer+0x60/0x60
Feb 29 03:12:15 irrational kernel:  __hrtimer_run_queues+0xec/0x190
Feb 29 03:12:15 irrational kernel:  hrtimer_interrupt+0x10c/0x240
Feb 29 03:12:15 irrational kernel:  smp_apic_timer_interrupt+0x61/0xa0
Feb 29 03:12:15 irrational kernel:  apic_timer_interrupt+0xf/0x20
Feb 29 03:12:15 irrational kernel:  </IRQ>
Feb 29 03:12:15 irrational kernel: RIP: 0010:mutex_unlock+0x18/0x30
Feb 29 03:12:15 irrational kernel: Code: 8b 70 10 e8 2a e6 6d ff eb aa e8 d3 29 6b ff 0f 1f 00 55 48 89 fd 31 c9 65 48 8b 14 25 c0 7c 01 00 48 89 d0 f0 48 0f b1 4d 00 <48> 39 c2 74 09 48 89 ef 5d e9 da fe ff ff 5d c3 0f 1f 84 00 00 00
Feb 29 03:12:15 irrational kernel: RSP: 0018:ffffb4ce4581be78 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Feb 29 03:12:15 irrational kernel: RAX: ffffa3949d078f00 RBX: 0000000000000004 RCX: 0000000000000000
Feb 29 03:12:15 irrational kernel: RDX: ffffa3949d078f00 RSI: 0000000000008f56 RDI: ffffa3949d1cc300
Feb 29 03:12:15 irrational kernel: RBP: ffffa3949d1cc300 R08: 0000000000000000 R09: ffffa3949d078f00
Feb 29 03:12:15 irrational kernel: R10: 000000000000000f R11: ffffa3938b40f501 R12: 0000000000010000
Feb 29 03:12:15 irrational kernel: R13: 0000000000008f56 R14: ffffa3949d1cc300 R15: ffffa3938b40f500
Feb 29 03:12:15 irrational kernel:  do_splice+0x442/0x660
Feb 29 03:12:15 irrational kernel:  __x64_sys_splice+0xf0/0x110
Feb 29 03:12:15 irrational kernel:  do_syscall_64+0x46/0x120
Feb 29 03:12:15 irrational kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 29 03:12:15 irrational kernel: RIP: 0033:0x7fd1f2249d1e
Feb 29 03:12:15 irrational kernel: Code: e8 a7 f7 f7 ff 89 c5 44 8b 4c 24 2c 4c 8b 44 24 20 4c 8b 54 24 18 8b 54 24 28 48 8b 74 24 10 8b 7c 24 08 b8 13 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 32 89 ef 48 89 44 24 08 e8 ce f7 f7 ff 48 8b
Feb 29 03:12:15 irrational kernel: RSP: 002b:00007fd1f1fdfd10 EFLAGS: 00000293 ORIG_RAX: 0000000000000113
Feb 29 03:12:15 irrational kernel: RAX: ffffffffffffffda RBX: 00007fff3347dae0 RCX: 00007fd1f2249d1e
Feb 29 03:12:15 irrational kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
Feb 29 03:12:15 irrational kernel: RBP: 0000000000000000 R08: 0000000000010000 R09: 0000000000000004
Feb 29 03:12:15 irrational kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00007fff3347d9de
Feb 29 03:12:15 irrational kernel: R13: 00007fff3347d9df R14: 00007fd1f1fe0700 R15: 00007fd1f1fdfe40
---
Less than 2 minutes later, the system completely stops responding.

If I use 5.4.0-rc6, backups are 100% reliable. Therefore, I don't think this issue is a hardware problem.
Comment 12 Mark Nowiasz 2020-03-02 14:34:54 UTC
As to myself:

I'm using a couple of virtual machines (kvm/qemu) ranging from 2 to 10 processors, from 8 GB to 64 GB Ram. All show exactly the same symptoms: As soon as my backup job (remote, btrbk, via ssh, which uses btrfs send/receive) starts all of them (when running kernel 5.5.x) will drop down their network. After the patch (5.5.3) they *may* regain their network (I've seen it on one of the machines, after about 3 or 5 hours), but that's not guaranteed. However, they are not completely unresponsive - I'm able to login via KVM console.

And 5.4.x works just fine, no problem whatsoever. Since they are virtualized guest I think we can drop the "faulty hardware" assumption.
Comment 13 Craig Andrews 2020-03-19 15:53:42 UTC
(In reply to Mark Nowiasz from comment #12)
> As to myself:
[snip]
> And 5.4.x works just fine, no problem whatsoever. Since they are virtualized
> guest I think we can drop the "faulty hardware" assumption.

Since you're using virtual machines (versus my not very powerful real hardware), could you bisect linux to find the commit that introduced the problem?

Is anyone else working on solving this issue? I'm happy to test proposed solutions!
Comment 14 Mark Nowiasz 2020-03-19 16:11:37 UTC
(In reply to Craig Andrews from comment #13)

> Since you're using virtual machines (versus my not very powerful real
> hardware), could you bisect linux to find the commit that introduced the
> problem?

Well, I haven't got the experience in bisecting/debugging the kernel myself :-)

However, I'd be happy to assist. Tell you what: I've replaced a couple of the virtual machines with new ones, the four machines are mainly idling/running BOINC for rosetta@home till the end of the contract. I'd be willing to give you access to one of them so you could do the work. All machines are running gentoo with gentoo-kernels. The only other information I can provide: on the 5.4.x branch everything works fine (currently 5.4.25 or 26) but each kernel on 5.5.x - starting with 5.5.0 - doesn't do at all.
 
> Is anyone else working on solving this issue? I'm happy to test proposed
> solutions!

Same here.
Comment 15 vladi 2020-04-05 14:53:46 UTC
Same problem exists with kernel 5.6.x.
Comment 16 vladi 2020-05-07 15:16:03 UTC
5.6.11 same issue:

adding stream_buffer 50m and rate_limit 50m to btrbk.conf does not seem to help..


[  894.137135] rcu: INFO: rcu_sched self-detected stall on CPU
[  894.137138] rcu:   7-....: (126005 ticks this GP) idle=962/1/0x4000000000000002 softirq=62672/62676 fqs=41785 last_accelerate: cc86/b8bc dyntick_enabled: 1
[  894.137139]  (t=126006 jiffies g=92125 q=142836)
[  894.137140] NMI backtrace for cpu 7
[  894.137141] CPU: 7 PID: 24457 Comm: btrfs Not tainted 5.6.11 #69
[  894.137142] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 5220 09/12/2019
[  894.137143] Call Trace:
[  894.137144]  <IRQ>
[  894.137148]  dump_stack+0x50/0x70
[  894.137150]  nmi_cpu_backtrace.cold+0x14/0x53
[  894.137152]  ? lapic_can_unplug_cpu.cold+0x39/0x39
[  894.137153]  nmi_trigger_cpumask_backtrace+0x7b/0x88
[  894.137155]  rcu_dump_cpu_stacks+0x7e/0xac
[  894.137156]  rcu_sched_clock_irq.cold+0x175/0x3cd
[  894.137157]  update_process_times+0x1f/0x50
[  894.137159]  tick_sched_timer+0x40/0x90
[  894.137160]  ? tick_sched_do_timer+0x70/0x70
[  894.137161]  __hrtimer_run_queues+0xe0/0x180
[  894.137162]  hrtimer_interrupt+0x108/0x230
[  894.137164]  smp_apic_timer_interrupt+0x61/0xa0
[  894.137165]  apic_timer_interrupt+0xf/0x20
[  894.137166]  </IRQ>
[  894.137167] RIP: 0010:mutex_lock+0x10/0x20
[  894.137168] Code: 66 2e 0f 1f 84 00 00 00 00 00 be 02 00 00 00 e9 06 fb ff ff 66 0f 1f 44 00 00 31 c0 65 48 8b 14 25 c0 7c 01 00 f0 48 0f b1 17 <74> 02 eb dc f3 c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 10 4c 89
[  894.137169] RSP: 0018:ffff9e3b4024fe70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  894.137170] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
[  894.137170] RDX: ffff942083324240 RSI: ffff941ee1de9e40 RDI: ffff941ee1de9e40
[  894.137171] RBP: ffff941eff062a80 R08: ffff941eff062a80 R09: ffff942083324240
[  894.137171] R10: 000000000000000f R11: ffff941fdb48fb00 R12: 0000000000010000
[  894.137172] R13: 000000000001355d R14: ffff941eff062a80 R15: ffff941ee1de9e40
[  894.137174]  pipe_double_lock+0x65/0x70
[  894.137176]  do_splice+0xf1/0x630
[  894.137177]  __x64_sys_splice+0x12a/0x150
[  894.137179]  do_syscall_64+0x50/0x1c0
[  894.137180]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  894.137181] RIP: 0033:0x7fe13e6fc1ab
[  894.137182] Code: e8 7a 52 f8 ff 44 8b 4c 24 2c 89 c5 4c 8b 44 24 20 4c 8b 54 24 18 b8 13 01 00 00 8b 54 24 28 48 8b 74 24 10 8b 7c 24 08 0f 05 <48> 3d 00 f0 ff ff 77 31 89 ef 48 89 44 24 08 e8 a1 52 f8 ff 48 8b
[  894.137183] RSP: 002b:00007fe13e5f9d20 EFLAGS: 00000293 ORIG_RAX: 0000000000000113
[  894.137184] RAX: ffffffffffffffda RBX: 00007ffdcebe3730 RCX: 00007fe13e6fc1ab
[  894.137184] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
[  894.137184] RBP: 0000000000000000 R08: 0000000000010000 R09: 0000000000000004
[  894.137185] R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffdcebe361e
[  894.137185] R13: 00007ffdcebe361f R14: 0000000000802000 R15: 00007fe13e5fa700
[  906.598113] INGRESS IN=docker0 OUT= MAC=ff:ff:ff:ff:ff:ff:02:42:ac:13:00:0a:08:00 SRC=172.19.0.10 DST=172.19.255.255 LEN=49 TOS=0x00 PREC=0x00 TTL=64 ID=30804 DF PROTO=UDP SPT=37379 DPT=32412 LEN=29
[  913.980578] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 7-... } 131651 jiffies s: 2073 root: 0x80/.
[  913.980584] rcu: blocking rcu_node structures:
[  913.980586] Task dump for CPU 7:
[  913.980588] btrfs           R  running task        0 24457  24452 0x80000008
[  913.980591] Call Trace:
[  913.980599]  ? __schedule+0x1d9/0x540
[  913.980603]  ? apic_timer_interrupt+0xa/0x20
[  913.980605]  ? apic_timer_interrupt+0xa/0x20
[  913.980608]  ? mutex_unlock+0x13/0x20
[  913.980612]  ? pipe_double_lock+0x65/0x70
[  913.980615]  ? do_splice+0xf1/0x630
[  913.980617]  ? __x64_sys_splice+0x12a/0x150
[  913.980620]  ? do_syscall_64+0x50/0x1c0
[  913.980623]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
Comment 17 Mark Nowiasz 2020-05-08 06:48:39 UTC
Since the btrfs folks is apparently completely indifferent to this severe bug maybe we (three?) can work together to find out what exactly the problem is?

Maybe compare kernel .config, see the common denominator etc?

What I also tried: tinkering with the scheduler, setting it to none (which is recommended vor virtio guest, anyway). But exactly the same result - btrbk caused the usual panic.
Comment 18 vladi 2020-05-09 17:21:03 UTC
Changing the RCU timeout settings:
❯ zcat /proc/config.gz | grep RCU
# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
CONFIG_RCU_FAST_NO_HZ=y
CONFIG_RCU_NOCB_CPU=y
# end of RCU Subsystem
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
# RCU Debugging
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# end of RCU Debugging


does help with lockups, but system still stalls during btrbk and causers other issues with processes dying and being killed off. This is a pretty big problem not sure why we are the only ones seeing this.
Comment 19 Craig Andrews 2020-07-06 19:31:51 UTC
I gave 5.8.0-rc3 a try and so far so good. My uptime is now at 3 and a half days including 3 successful overnight backups. Perhaps this bug was fixed sometime between 5.5.0-rc1 and 5.8.0-rc3?

I'll report back again after more time passes. Perhaps others can try 5.8 kernels as well and report their results?
Comment 20 Mark Nowiasz 2020-08-05 03:57:25 UTC
(In reply to Craig Andrews from comment #19)
> I gave 5.8.0-rc3 a try and so far so good. My uptime is now at 3 and a half
> days including 3 successful overnight backups. Perhaps this bug was fixed
> sometime between 5.5.0-rc1 and 5.8.0-rc3?
> 
> I'll report back again after more time passes. Perhaps others can try 5.8
> kernels as well and report their results?

I can confirm this. I've updated three machines to 5.8.0, and all three are running fine and survived btrbk's backup run (which previously killed them)

So I guess the issue has been resolved somehow in 5.18...
Comment 21 Mark Nowiasz 2020-08-05 03:58:05 UTC
5.8 that is, not 5.18 :-)
Comment 22 vladi 2020-08-20 19:35:30 UTC
Also confirming issue is solved in 5.8.x

Note You need to log in before you can comment on or make changes to this bug.