Bug 210185
Summary: | kernel BUG at fs/ext4/page-io.c:126! | ||
---|---|---|---|
Product: | File System | Reporter: | emchroma |
Component: | ext4 | Assignee: | fs_ext4 (fs_ext4) |
Status: | NEW --- | ||
Severity: | normal | CC: | emchroma, enbyamy, gernot.poerner, tytso |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 5.9.8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
AutoCkt Environment
Training parameter Cleanup Script AutoCkt patch |
Description
emchroma
2020-11-13 09:16:17 UTC
Hey there! Traced back that file. That line runs page_buffers(), which launches a bug if PagePrivate is false. #define page_buffers(page) \ ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) What application were you running, if I may ask? (In reply to Amy from comment #1) > What application were you running, if I may ask? Hi Amy, it happens with AutoCtk (https://github.com/ksettaluri6/AutoCkt), so some Python code. Everything runs fine with kernel 5.7.2, last week we went to kernel 5.9.1 and this bug appeared. Today I've tried with it kernel 5.9.8, but no luck. The machine is unstable after this bug happens. The load goues up but we can't kill the python processes, they're all in state D. Also, a clean reboot is not possible, only a hard reset helps. How easily can you reproduce the problem? Can you give us instructions for a reliable repro (e.g., download AutoCtk, run it with these options and this input file, and it will crash in N minutes)? Thanks! (In reply to Theodore Tso from comment #3) Hi Theodore, > How easily can you reproduce the problem? on each invocation. > Can you give us instructions for a reliable repro (e.g., download AutoCtk, > run it with these options and this input file, and it will crash in N > minutes)? You'll need Anaconda, Ngspice and AutoCkt. I'm on debian and I've apt-get ngspice. I'll try a step by step procedure for Anaconda and AutoCkt We're using https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh Download and install Anaconda # activate base environment $ . .bashrc # upgrade pip $ pip install --upgrade pip # download AutoCkt $ git clone https://github.com/ksettaluri6/AutoCkt.git # create autockt_updated environment, use the attached autockt_updated.yml, # not the environment provided by AutoCkt $ conda env create -f autockt_updated.yml # activate environment autockt_upgraded $ conda activate autockt_updated # copy the attached parameter file to AutoCkt/autockt $ cp val_autobag_ray_hyperparameter_tuning.py AutoCkt/autockt # run AutoCkt $ cd AutoCkt $ python autockt/gen_specs.py --num_specs 600 $ ipython $ run autockt/val_autobag_ray_hyperparameter_tuning.py # this is in the IPython shell AutoCkt creates a lot of files in /tmp/ckt_da and /tmp/ray. tmp is on the root partition and our root is rather small (100G), therefore we start a python script to delete files/directories older than 10 minutes in /tmp/ckt_da from a second shell # cleanup /tmp/ckt_da $ python auto_delete_old_tmp_ckt_da_files.py AutoCkt is very verbose and you'll probably see some warnings, but it should run. On our machines it usually takes 10-15 minutes, sometimes up to 30 minutes until the bug appears. Not sure whether it matters, but we're using software raid1 for the root partition. Created attachment 293715 [details]
AutoCkt Environment
Created attachment 293717 [details]
Training parameter
Created attachment 293719 [details]
Cleanup Script
Sorry, I forgot that we modified AutoCkt to our needs. Before your run AutoCkt, please apply the attached patch and modify the .include in AutoCkt/eval_engines/ngspice/ngspice_inputs/netlist/ACDAI.cir and AutoCkt/eval_engines/ngspice/ngspice_inputs/netlist/two_stage_opamp.cir to match your path Created attachment 293725 [details]
AutoCkt patch
I've made some more tests with different kernels and it looks like the the bug was introduced with kernel 5.9.0. Kernel 5.8.18 and below (5.8.0 and 5.8.16) runs fine, kernel 5.9.0 shows the bug. Today I've tried kernel 5.9.9, which also crashes. I've also made the test on a different machine with different hardware and different setup - there, all kernels seems to run fine. At least program ran for a few hours before I stopped it. Hi, any updates here lately? Hitting the same bug on a completely different workload and setup. 5.9.x crashes, 5.8.x works, this is all on Debian 10 Buster running as a VM Currently we downgrade to 5.8.x because of this. I can provide further information if interested. On Tue, Feb 2, 2021 at 3:00 AM <bugzilla-daemon@bugzilla.kernel.org> wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=210185 > > gpo (gernot.poerner@web.de) changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |gernot.poerner@web.de > > --- Comment #11 from gpo (gernot.poerner@web.de) --- > Hi, any updates here lately? Nope, forgot about this thread. > > Hitting the same bug on a completely different workload and setup. 5.9.x > crashes, 5.8.x works, this is all on Debian 10 Buster running as a VM > Do 5.10.x or 5.11-rcx work? > Currently we downgrade to 5.8.x because of this. I can provide further > information if interested. Yes, that'd be great. You said you tried 5.9.1 - does it break on 5.9.0? How many CPU's and memory do you have on your servers? And what distribution and distro version are you using, in case that matters? This is on Debian Buster running 5.x kernels from backports. These are VMs (on kvm) with 4 cores/32G Ram each. They are running as kubernetes workers. When we set those up with Debian which in our image comes with kernel 5.9 they constantly crashed after a while with: [58832.748868] ------------[ cut here ]------------ [58832.749391] kernel BUG at fs/ext4/page-io.c:126! [58832.749797] invalid opcode: 0000 [#1] SMP PTI [58832.750166] CPU: 3 PID: 4233 Comm: kworker/u8:4 Not tainted 5.9.0-0.bpo.2-amd64 #1 Debian 5.9.6-1~bpo10+1 [58832.750903] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS [58832.751468] Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work [ext4] [58832.752001] RIP: 0010:ext4_finish_bio+0x248/0x250 [ext4] [58832.752434] Code: c7 a0 d0 66 c0 e8 d8 a0 03 f3 85 c0 0f 84 4c ff ff ff e9 f8 37 02 00 49 8b 44 24 28 4c 89 64 24 28 48 89 04 24 e9 51 fe ff ff <0f> 0b 66 0f 1f 44 00 00 66 66 66 66 90 41 57 41 56 41 55 41 54 55 [58832.753829] RSP: 0018:ffffb7b603637dc0 EFLAGS: 00010246 [58832.754264] RAX: 0000000000000081 RBX: ffff8c042706ae40 RCX: 0000000000000001 [58832.754822] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff8c042706ae40 [58832.755378] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffffffc0682b50 [58832.755943] R10: ffff8c0262c7f9d8 R11: 0000000000000001 R12: ffffefbd1c982cc0 [58832.756510] R13: ffff8c042706ae40 R14: 0000000000000001 R15: 0000000000000000 [58832.757055] FS: 0000000000000000(0000) GS:ffff8c045fd80000(0000) knlGS:0000000000000000 [58832.757677] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58832.758138] CR2: 00007fd4543d2d80 CR3: 0000000609c48004 CR4: 00000000000606e0 [58832.758703] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [58832.759252] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [58832.759801] Call Trace: [58832.760048] ext4_release_io_end+0x48/0xf0 [ext4] [58832.760422] ext4_end_io_rsv_work+0x92/0x180 [ext4] [58832.760824] process_one_work+0x1ad/0x370 [58832.761191] worker_thread+0x30/0x390 [58832.761505] ? create_worker+0x1a0/0x1a0 [58832.761843] kthread+0x116/0x130 [58832.762115] ? kthread_park+0x80/0x80 [58832.762426] ret_from_fork+0x22/0x30 [58832.762719] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc fscache iptable_filter xt_CT xt_multiport xt_nat xt_tcpudp veth dm_mod xt_set ip_set_hash_ipport ip_set_bitmap_port ip_set_hash_ipportnet ip_set_hash_ipportip ip_set dummy ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_comment xt_mark xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common crc32_pclmul ghash_clmulni_intel aesni_intel libaes crypto_simd cryptd glue_helper hid_generic usbhid hid virtio_balloon joydev evdev pcspkr serio_raw qemu_fw_cfg button sunrpc tcp_bbr ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic ata_generic virtio_net net_failover virtio_blk failover uhci_hcd ata_piix ehci_hcd libata usbcore scsi_mod crct10dif_pclmul crct10dif_common psmouse usb_common [58832.762748] crc32c_intel virtio_pci virtio_ring virtio i2c_piix4 [58832.769326] ---[ end trace 22c803ae565a2f8c ]--- [58832.769746] RIP: 0010:ext4_finish_bio+0x248/0x250 [ext4] [58832.770183] Code: c7 a0 d0 66 c0 e8 d8 a0 03 f3 85 c0 0f 84 4c ff ff ff e9 f8 37 02 00 49 8b 44 24 28 4c 89 64 24 28 48 89 04 24 e9 51 fe ff ff <0f> 0b 66 0f 1f 44 00 00 66 66 66 66 90 41 57 41 56 41 55 41 54 55 [58832.771522] RSP: 0018:ffffb7b603637dc0 EFLAGS: 00010246 [58832.771927] RAX: 0000000000000081 RBX: ffff8c042706ae40 RCX: 0000000000000001 [58832.772454] RDX: 0000000000000000 RSI: 0000000000001000 RDI: ffff8c042706ae40 [58832.772984] RBP: 0000000000001000 R08: 0000000000000000 R09: ffffffffc0682b50 [58832.773521] R10: ffff8c0262c7f9d8 R11: 0000000000000001 R12: ffffefbd1c982cc0 [58832.774045] R13: ffff8c042706ae40 R14: 0000000000000001 R15: 0000000000000000 [58832.774566] FS: 0000000000000000(0000) GS:ffff8c045fd80000(0000) knlGS:0000000000000000 [58832.775157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58832.775604] CR2: 00007fd4543d2d80 CR3: 0000000609c48004 CR4: 00000000000606e0 [58832.776122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [58832.776636] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [58832.777195] Kernel panic - not syncing: Fatal exception [58832.777840] Kernel Offset: 0x32200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [58832.778610] Rebooting in 1 seconds.. Loading Linux 5.9.0-0.bpo.2-amd64 ... Loading initial ramdisk ... [ 0.000000] Linux version 5.9.0-0.bpo.2-amd64 (debian-kernel@lists.debian.org) (gcc-8 (Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP Debian 5.9.6-1~bpo10+1 (2020-11-19) [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.9.0-0.bpo.2-amd64 root=UUID=b722221a-e6a3-4609-a545-19454bd4f53c ro console=tty0 console=ttyS0,115200n8 noplymouth elevator=noop nomodeset net.ifnames=0 biosdevname=0 [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 [ 0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format. [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable [ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bffdbfff] usable [ 0.000000] BIOS-e820: [mem 0x00000000bffdc000-0x00000000bfffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000083fffffff] usable This is also how I came upon this bug report. Downgrading to the latest 5.8 (5.8.0-0.bpo.2-amd64 #1 SMP Debian 5.8.10-1~bpo10+1 (2020-09-26) x86_64 GNU/Linux) stopped the constant crashing. The problem here is that we could not identify for sure what really causes the crash, these nodes are running differing workloads since they are kubernetes. I would go ahead and test a later kernel (5.10) on one of these to see if the problem is fixed in upstream already. Ok, quick update here, we built a vanilla 5.10.16 and ran that for a couple of days and it didn't occur again, so it seems to be fixed, at least for our problem. We will test with the official Debian 5.10 which now also was released but I'm very optimistic it's fixed. |