Bug 199457

Summary: [NFS] general protection fault: 0000 [#1] SMP PTI
Product: File System Reporter: JianhongYin (yin-jianhong)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: NEW ---    
Severity: normal    
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.17.0-rc1 Subsystem:
Regression: No Bisected commit-id:
Attachments: reproducer

Description JianhongYin 2018-04-21 02:59:07 UTC
got panic when run lustre-racer over NFS test on latest upstream:

'''
[   63.927876] NFSD: starting 90-second grace period (net f00000a8) 
[   63.981533] FS-Cache: Loaded 
[   64.113298] FS-Cache: Netfs 'nfs' registered for caching 
[   64.129363] Key type dns_resolver registered 
[   64.291524] NFS: Registering the id_resolver key type 
[   64.297179] Key type id_resolver registered 
[   64.301859] Key type id_legacy registered 
[  106.154061] 8[62671]: segfault at 8 ip 00007fab2a5de918 sp 00007ffc6598c230 error 4 in ld-2.17.so[7fab2a5d3000+22000] 
[  254.397650] 3[15193]: segfault at 0 ip 0000000000403e4f sp 00007ffed8116660 error 6 in 3 (deleted)[400000+6000] 
[-- MARK -- Fri Apr 20 20:00:00 2018] 
[  414.125571] 8[78617]: segfault at 0 ip 0000000000403e4f sp 00007ffdee59a8e0 error 6 in 8 (deleted)[400000+6000] 
[  500.677718] perf: interrupt took too long (2519 > 2500), lowering kernel.perf_event_max_sample_rate to 79000 
[  506.303294] 7[53144]: segfault at 8 ip 00007f6e488a4918 sp 00007ffd0c3040d0 error 4 in ld-2.17.so[7f6e48899000+22000] 
[  616.201854] 8[46687]: segfault at 4017a0 ip 00000000004017a0 sp 00007ffd095c2710 error 14 
[  637.442799] perf: interrupt took too long (3149 > 3148), lowering kernel.perf_event_max_sample_rate to 63000 
[-- MARK -- Fri Apr 20 20:05:00 2018] 
[  731.965938] 13[17830]: segfault at 0 ip 0000000000403e4f sp 00007ffdbaf15bd0 error 6 in 13 (deleted)[400000+6000] 
[  803.352338] 0[46544]: segfault at 0 ip 0000000000403e4f sp 00007fffdd30aa50 error 6 in 0 (deleted)[400000+6000] 
[  812.734251] 6[40484]: segfault at 0 ip 0000000000403e4f sp 00007ffedab551b0 error 6 in 6 (deleted)[400000+6000] 
[  841.253201] perf: interrupt took too long (3969 > 3936), lowering kernel.perf_event_max_sample_rate to 50000 
[-- MARK -- Fri Apr 20 20:10:00 2018] 
[ 1024.139506] 4[48815]: segfault at 0 ip 0000000000403e4f sp 00007fffdfc9aee0 error 6 in 4 (deleted)[400000+6000] 
[ 1086.414180] 18[38715]: segfault at 0 ip 0000000000403e4f sp 00007ffddf031c80 error 6 in 18 (deleted)[400000+6000] 
[ 1146.510480] perf: interrupt took too long (4972 > 4961), lowering kernel.perf_event_max_sample_rate to 40000 
[ 1158.138011] NFS: server localhost error: fileid changed 
[ 1158.138011] fsid 0:45: expected fileid 0x0, got 0x84f 
[ 1158.149494] NFS: server localhost error: fileid changed 
[ 1158.149494] fsid 0:45: expected fileid 0x0, got 0x84f 
[ 1158.160963] NFS: server localhost error: fileid changed 
[ 1158.160963] fsid 0:45: expected fileid 0x0, got 0x84f 
[-- MARK -- Fri Apr 20 20:15:00 2018] 
[ 1392.954349] general protection fault: 0000 [#1] SMP PTI 
[ 1392.960185] Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ext4 mbcache jbd2 loop nfsd auth_rpcgss nfs_acl lockd grace intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp sunrpc kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd glue_helper ipmi_ssif iTCO_wdt mei_me iTCO_vendor_support mxm_wmi pcspkr joydev mei sg i2c_i801 shpchp lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad ip_tables xfs libcrc32c sr_mod cdrom sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe ahci libahci libata crc32c_intel i2c_core mdio dca dm_mirror dm_region_hash dm_log dm_mod 
[ 1393.031909] CPU: 52 PID: 5882 Comm: nfsd Not tainted 4.17.0-rc1 #1 
[ 1393.038808] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRRFSDP1.86B.0271.R00.1510301446 10/30/2015 
[ 1393.050167] RIP: 0010:prefetch_freepointer.isra.65+0xc/0x20 
[ 1393.056389] RSP: 0018:ffffbc06089b7c30 EFLAGS: 00010286 
[ 1393.062222] RAX: 0000000000000000 RBX: ff92b498cc49f800 RCX: 0000000000000183 
[ 1393.070188] RDX: 0000000000000182 RSI: ff92b498cc49f800 RDI: ffff92b49f87de20 
[ 1393.078154] RBP: 00000000014080c0 R08: 000000000002dfc0 R09: ffff92b479731070 
[ 1393.086112] R10: 0000000001ce3140 R11: 0000000000000000 R12: ffff92b498cc490f 
[ 1393.094081] R13: ffff92b49f87de00 R14: ffffffffc0713fe3 R15: ffff92b49f87de00 
[ 1393.102048] FS:  0000000000000000(0000) GS:ffff92b49f980000(0000) knlGS:0000000000000000 
[ 1393.111081] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[ 1393.117497] CR2: 000000000241768c CR3: 0000000830e0a002 CR4: 00000000003606e0 
[ 1393.125465] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[ 1393.133432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 
[ 1393.141399] Call Trace: 
[ 1393.144137]  kmem_cache_alloc+0xa3/0x1c0 
[ 1393.148541]  ? nfsd4_free_file_rcu+0x20/0x20 [nfsd] 
[ 1393.153990]  nfs4_alloc_stid+0x23/0xb0 [nfsd] 
[ 1393.158873]  nfsd4_process_open2+0xc6b/0x12a0 [nfsd] 
[ 1393.164434]  ? fh_verify+0x33e/0x5f0 [nfsd] 
[ 1393.169117]  ? nfsd4_process_open1+0x137/0x400 [nfsd] 
[ 1393.174771]  nfsd4_open+0x4ce/0x790 [nfsd] 
[ 1393.179360]  ? nfsd4_encode_operation+0xdf/0x1c0 [nfsd] 
[ 1393.185205]  nfsd4_proc_compound+0x5aa/0x7b0 [nfsd] 
[ 1393.190668]  nfsd_dispatch+0xb7/0x250 [nfsd] 
[ 1393.195447]  svc_process_common+0x399/0x750 [sunrpc] 
[ 1393.201011]  svc_process+0xeb/0x100 [sunrpc] 
[ 1393.205793]  nfsd+0xe3/0x150 [nfsd] 
[ 1393.209694]  kthread+0xf8/0x130 
[ 1393.213215]  ? nfsd_destroy+0x60/0x60 [nfsd] 
[ 1393.217987]  ? kthread_bind+0x10/0x10 
[ 1393.222083]  ret_from_fork+0x35/0x40 
[ 1393.226074] Code: 48 89 fb e8 57 f1 ff ff 48 89 df 48 89 ee 31 c9 5b 5d 31 d2 e9 46 f1 ff ff 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 f6 74 09 8b 07 <48> 8b 04 06 0f 18 08 f3 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f  
[ 1393.247162] RIP: prefetch_freepointer.isra.65+0xc/0x20 RSP: ffffbc06089b7c30 
[ 1393.255103] ---[ end trace 27db7bc3128451f1 ]--- 
[ 1393.260262] Kernel panic - not syncing: Fatal exception 
[ 1393.266153] WARNING: CPU: 82 PID: 504 at kernel/sched/core.c:1163 set_task_cpu+0x17b/0x190 
[ 1393.275379] Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ext4 mbcache jbd2 loop nfsd auth_rpcgss nfs_acl lockd grace intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp sunrpc kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd glue_helper ipmi_ssif iTCO_wdt mei_me iTCO_vendor_support mxm_wmi pcspkr joydev mei sg i2c_i801 shpchp lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad ip_tables xfs libcrc32c sr_mod cdrom sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe ahci libahci libata crc32c_intel i2c_core mdio dca dm_mirror dm_region_hash dm_log dm_mod 
[ 1393.347085] CPU: 82 PID: 504 Comm: migration/82 Tainted: G      D           4.17.0-rc1 #1 
[ 1393.356211] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRRFSDP1.86B.0271.R00.1510301446 10/30/2015 
[ 1393.367566] RIP: 0010:set_task_cpu+0x17b/0x190 
[ 1393.372522] RSP: 0018:ffffbc060762be20 EFLAGS: 00010046 
[ 1393.378351] RAX: 0000000000000000 RBX: ffff92b4dc962e00 RCX: 0000000000000000 
[ 1393.386314] RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff92b4dc962e00 
[ 1393.394276] RBP: ffff92bc9f3a23c0 R08: 0000000000000bab R09: 0000000000020ef2 
[ 1393.402238] R10: 000000000000007f R11: ffffbc06358dfc34 R12: 000000000000000e 
[ 1393.410200] R13: 000000000000000e R14: ffffbc06358dfe70 R15: ffff92bc9f39d4b0 
[ 1393.418163] FS:  0000000000000000(0000) GS:ffff92bc9f380000(0000) knlGS:0000000000000000 
[ 1393.427191] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[ 1393.433602] CR2: 00007f20a0eb645f CR3: 0000000830e0a006 CR4: 00000000003606e0 
[ 1393.441564] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[ 1393.449526] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 
[ 1393.457488] Call Trace: 
[ 1393.460221]  move_queued_task+0x86/0x160 
[ 1393.464596]  migration_cpu_stop+0xf8/0x110 
[ 1393.469168]  ? sched_ttwu_pending+0xd0/0xd0 
[ 1393.473838]  cpu_stopper_thread+0x86/0x100 
[ 1393.478414]  smpboot_thread_fn+0x10e/0x160 
[ 1393.482986]  kthread+0xf8/0x130 
[ 1393.486482]  ? sort_range+0x20/0x20 
[ 1393.490371]  ? kthread_bind+0x10/0x10 
[ 1393.494462]  ret_from_fork+0x35/0x40 
[ 1393.498450] Code: 0f ff ff ff 80 8b 5c 08 00 00 04 e9 2b ff ff ff 0f 0b e9 ca fe ff ff f7 43 60 fd ff ff ff 0f 84 d4 fe ff ff 0f 0b e9 cd fe ff ff <0f> 0b 0f 1f 00 e9 d9 fe ff ff 90 66 2e 0f 1f 84 00 00 00 00 00  
[ 1393.519524] ---[ end trace 27db7bc3128451f2 ]--- 
[ 1393.524688] Kernel Offset: 0x2cc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) 
[ 1393.536734] ---[ end Kernel panic - not syncing: Fatal exception ]--- 
[ 1393.543930] ------------[ cut here ]------------ 
[ 1393.549081] sched: Unexpected reschedule of offline CPU#22! 
[ 1393.555305] WARNING: CPU: 52 PID: 5882 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x39/0x40 
[ 1393.565887] Modules linked in: nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ext4 mbcache jbd2 loop nfsd auth_rpcgss nfs_acl lockd grace intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp sunrpc kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd cryptd glue_helper ipmi_ssif iTCO_wdt mei_me iTCO_vendor_support mxm_wmi pcspkr joydev mei sg i2c_i801 shpchp lpc_ich ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter acpi_pad ip_tables xfs libcrc32c sr_mod cdrom sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe ahci libahci libata crc32c_intel i2c_core mdio dca dm_mirror dm_region_hash dm_log dm_mod 
[ 1393.637594] CPU: 52 PID: 5882 Comm: nfsd Tainted: G      D W         4.17.0-rc1 #1 
[ 1393.646041] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRRFSDP1.86B.0271.R00.1510301446 10/30/2015 
[ 1393.657398] RIP: 0010:native_smp_send_reschedule+0x39/0x40 
[ 1393.663517] RSP: 0018:ffff92b49f983d08 EFLAGS: 00010086 
[ 1393.669345] RAX: 0000000000000000 RBX: ffff92ad43139700 RCX: 0000000000000006 
[ 1393.677305] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff92b49f9968b0 
[ 1393.685267] RBP: 00000000000223c0 R08: 0000000000000000 R09: 00000000000005c6 
[ 1393.693227] R10: 0000000000000000 R11: ffff92b49f983a68 R12: ffff92ad4313a25c 
[ 1393.701189] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000016 
[ 1393.709153] FS:  0000000000000000(0000) GS:ffff92b49f980000(0000) knlGS:0000000000000000 
[ 1393.718180] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[ 1393.724591] CR2: 000000000241768c CR3: 0000000830e0a002 CR4: 00000000003606e0 
[ 1393.732551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[ 1393.740514] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 
[ 1393.748475] Call Trace: 
[ 1393.751204]  <IRQ> 
[ 1393.753449]  try_to_wake_up+0x3d7/0x450 
[ 1393.757729]  __wake_up_common+0x8f/0x160 
[ 1393.762107]  ep_poll_callback+0xc4/0x2f0 
[ 1393.766486]  __wake_up_common+0x8f/0x160 
[ 1393.770851]  __wake_up_common_lock+0x7a/0xc0 
[ 1393.775619]  irq_work_run_list+0x4c/0x70 
[ 1393.779998]  ? tick_sched_do_timer+0x80/0x80 
[ 1393.784756]  update_process_times+0x3b/0x50 
[ 1393.789424]  tick_sched_handle+0x25/0x60 
[ 1393.793800]  tick_sched_timer+0x37/0x70 
[ 1393.798080]  __hrtimer_run_queues+0xfb/0x270 
[ 1393.802845]  hrtimer_interrupt+0xfb/0x230 
[ 1393.807320]  smp_apic_timer_interrupt+0x62/0x130 
[ 1393.812471]  apic_timer_interrupt+0xf/0x20 
[ 1393.817041]  </IRQ> 
[ 1393.819383] RIP: 0010:panic+0x206/0x25c 
[ 1393.823658] RSP: 0018:ffffbc06089b7ae8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 
[ 1393.832105] RAX: 0000000000000039 RBX: 0000000000000200 RCX: 0000000000000006 
[ 1393.840065] RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff92b49f9968b0 
[ 1393.848026] RBP: ffffbc06089b7b58 R08: 0000000000000000 R09: 00000000000005c4 
[ 1393.855989] R10: 0000000000000000 R11: ffffbc06089b7850 R12: ffffffffaea6025c 
[ 1393.863950] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 
[ 1393.871913]  ? panic+0x1ff/0x25c 
[ 1393.875515]  oops_end+0xb0/0xc0 
[ 1393.879019]  general_protection+0x1e/0x30 
[ 1393.883495] RIP: 0010:prefetch_freepointer.isra.65+0xc/0x20 
[ 1393.889703] RSP: 0018:ffffbc06089b7c30 EFLAGS: 00010286 
[ 1393.895532] RAX: 0000000000000000 RBX: ff92b498cc49f800 RCX: 0000000000000183 
[ 1393.903494] RDX: 0000000000000182 RSI: ff92b498cc49f800 RDI: ffff92b49f87de20 
[ 1393.911454] RBP: 00000000014080c0 R08: 000000000002dfc0 R09: ffff92b479731070 
[ 1393.919417] R10: 0000000001ce3140 R11: 0000000000000000 R12: ffff92b498cc490f 
[ 1393.927379] R13: ffff92b49f87de00 R14: ffffffffc0713fe3 R15: ffff92b49f87de00 
[ 1393.935354]  ? nfs4_alloc_stid+0x23/0xb0 [nfsd] 
[ 1393.940402]  kmem_cache_alloc+0xa3/0x1c0 
[ 1393.944784]  ? nfsd4_free_file_rcu+0x20/0x20 [nfsd] 
[ 1393.950230]  nfs4_alloc_stid+0x23/0xb0 [nfsd] 
[ 1393.955095]  nfsd4_process_open2+0xc6b/0x12a0 [nfsd] 
[ 1393.960635]  ? fh_verify+0x33e/0x5f0 [nfsd] 
[ 1393.965304]  ? nfsd4_process_open1+0x137/0x400 [nfsd] 
[ 1393.970944]  nfsd4_open+0x4ce/0x790 [nfsd] 
[ 1393.975518]  ? nfsd4_encode_operation+0xdf/0x1c0 [nfsd] 
[ 1393.981351]  nfsd4_proc_compound+0x5aa/0x7b0 [nfsd] 
[ 1393.986795]  nfsd_dispatch+0xb7/0x250 [nfsd] 
[ 1393.991568]  svc_process_common+0x399/0x750 [sunrpc] 
[ 1393.997114]  svc_process+0xeb/0x100 [sunrpc] 
[ 1394.001880]  nfsd+0xe3/0x150 [nfsd] 
[ 1394.005771]  kthread+0xf8/0x130 
[ 1394.009277]  ? nfsd_destroy+0x60/0x60 [nfsd] 
[ 1394.014038]  ? kthread_bind+0x10/0x10 
[ 1394.018122]  ret_from_fork+0x35/0x40 
[ 1394.022108] Code: 01 0f 92 c0 84 c0 74 15 48 8b 05 73 53 ed 00 be fd 00 00 00 48 8b 40 30 e9 25 d9 9a 00 89 fe 48 c7 c7 f8 82 a6 ae e8 c7 d1 03 00 <0f> 0b c3 0f 1f 40 00 0f 1f 44 00 00 53 be 20 00 08 01 48 89 fb  
[ 1394.043183] ---[ end trace 27db7bc3128451f3 ]--- 
[-- MARK -- Fri Apr 20 20:20:00 2018] 
[-- MARK -- Fri Apr 20 20:25:00 2018] 
'''
Comment 1 JianhongYin 2018-04-21 03:27:56 UTC
test script:

'''
svn export https://github.com/hpc/lustre/trunk/lustre/tests/racer lustre-racer.orig

BZ=lustre-racer
loopf=/tmp/loop-$BZ
nfsexpdir=/mnt/lustre
nfsmp=/mnt/nfsmp
nfsmp2=/mnt/nfsmp2

mntSrc1=${REMOTE_SHARE:-localhost:$nfsexpdir}
mntSrc2=${REMOTE_SHARE2:-localhost:/}

        mkdir -p $nfsexpdir $nfsmp $nfsmp2
        echo "$nfsexpdir *(rw,no_root_squash,fsid=0)" >/etc/exports

        dd if=/dev/zero of=$loopf bs=1024k count=1024
        mkfs.ext4 -F $loopf

        service rpcbind start
        mount none /var/lib/nfs -t tmpfs
        touch /var/lib/nfs/etab
        # start nfs-mountd.service
        service_nfs start

        pushd lustre-racer.orig
        while :; do
                date

                mount $loopf $nfsexpdir -o loop || exit 1
                mkdir $nfsexpdir/racer
                # start nfs-server.service
                service_nfs restart || exit 2
                mount $mntSrc1 $nfsmp -t nfs -o nolock || exit 3
                mount $mntSrc2 $nfsmp2 -t nfs4 || exit 4
                DURATION=3600 sh racer.sh $nfsmp/racer &
                DURATION=3600 sh racer.sh $nfsmp2/racer &
                DURATION=3600 sh racer.sh $nfsexpdir/racer &
                wait %1 %2 %3
                umount $nfsmp || exit 5
                umount $nfsmp2 || exit 6
                # stop nfs-server.service
                service_nfs stop || exit 7
                umount $nfsexpdir || exit 8
                e2fsck -n -f $loopf || exit 9
        done
        popd
'''
Comment 2 JianhongYin 2018-04-21 03:57:11 UTC
Created attachment 275471 [details]
reproducer

How reproducable: 20+%