Most recent kernel where this bug did not occur: Unknown. Nonexistent? Distribution: Gentoo Linux 2005.x Hardware Environment: CPU: DEC 21164 EV5 at 266Mhz System: DEC 21171 Alcor IEEE1394 card: Creative Labs SB Audigy FireWire Port Firewire to IDE bridge: Oxford 911 Software Environment: Linux 2.6.15, mount 2.12r, xfsprogs 2.6.13 Partition requiring log replay on the IDE hard drive Problem Description: When I attached and tried to mount the drive, I got: ieee1394: sbp2: Logged into SBP-2 device ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048] Vendor: WDC WD80 Model: 0JB-00JJA0 Rev: 05.0 Type: Direct-Access-RBC ANSI SCSI revision: 04 SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB) sdb: asking for cache data failed sdb: assuming drive cache: write through SCSI device sdb: 156301488 512-byte hdwr sectors (80026 MB) sdb: asking for cache data failed sdb: assuming drive cache: write through sdb: sdb1 sdb2 sdb3 sdb4 sd 4:0:0:0: Attached scsi disk sdb XFS mounting filesystem sdb1 Starting XFS recovery on filesystem: sdb1 (logdev: internal) Unable to handle kernel paging request at virtual address b6db69da5323dc00 mount(8951): Oops 0 pc = [<fffffc000048fbc8>] ra = [<fffffc000049088c>] ps = 0000 Not tainted pc is at xlog_recover_commit_trans+0x4b8/0x19c0 ra is at xlog_recover_commit_trans+0x117c/0x19c0 v0 = b6db69da5323dc00 t0 = fffffc0000000000 t1 = 0000aa33f90ee000 t2 = 3e9c4e70ee000000 t3 = fffffc003e8a7e58 t4 = fffffc0000790000 t5 = fffffc003e8a7e58 t6 = fffffc003e8a7d28 t7 = fffffc003b9d0000 s0 = fffffc003c82e5e0 s1 = fffffc003e8a7cf0 s2 = 0000000000000000 s3 = 0000000000000000 s4 = fffffc003f279800 s5 = 0000000000000000 s6 = b6db69da5323dc00 a0 = fffffc003e8a7cf0 a1 = 0000000000001c00 a2 = 0000000000000000 a3 = 0000000000000000 a4 = 0000000000004001 a5 = fffffc0000700b98 t8 = 0000000000000000 t9 = 000000410d9b4090 t10= 8200000000000000 t11= 0000000000000000 pv = fffffc00004ab4d0 at = 000000000000007f gp = fffffc000075d400 sp = fffffc003b9d36c8 Trace: [<fffffc00004915e0>] xlog_recover_process_data+0x510/0x670 [<fffffc00004a8ea8>] kmem_alloc+0x98/0x190 [<fffffc00004915e0>] xlog_recover_process_data+0x510/0x670 [<fffffc00004925a8>] xlog_do_recovery_pass+0x4a8/0x7e0 [<fffffc00004929dc>] xlog_recover+0xfc/0x2e0 [<fffffc000049299c>] xlog_recover+0xbc/0x2e0 [<fffffc000048a97c>] xfs_log_mount+0x43c/0x6d0 [<fffffc0000495444>] xfs_mountfs+0xc14/0x11b0 [<fffffc00004847b0>] xfs_ioinit+0x50/0x70 [<fffffc00004ad588>] pagebuf_iostart+0xf8/0x130 [<fffffc00004929dc>] xlog_recover+0xfc/0x2e0 [<fffffc000049299c>] xlog_recover+0xbc/0x2e0 [<fffffc000048a97c>] xfs_log_mount+0x43c/0x6d0 [<fffffc0000495444>] xfs_mountfs+0xc14/0x11b0 [<fffffc00004847b0>] xfs_ioinit+0x50/0x70 [<fffffc00004ad588>] pagebuf_iostart+0xf8/0x130 [<fffffc000049386c>] xfs_readsb+0x13c/0x6c0 [<fffffc00004adba0>] xfs_setsize_buftarg_flags+0xc0/0x190 [<fffffc00004847b0>] xfs_ioinit+0x50/0x70 [<fffffc000049d884>] xfs_mount+0xa14/0xb70 [<fffffc00004b617c>] vfs_mount+0x3c/0x60 [<fffffc00004b5d30>] linvfs_fill_super+0x0/0x3d0 [<fffffc00004b5e38>] linvfs_fill_super+0x108/0x3d0 [<fffffc0000383cdc>] get_sb_bdev+0x1cc/0x2b0 [<fffffc0000382c00>] sget+0x3a0/0x420 [<fffffc0000383bac>] get_sb_bdev+0x9c/0x2b0 [<fffffc00003844f4>] sb_set_blocksize+0x34/0xa0 [<fffffc0000383cbc>] get_sb_bdev+0x1ac/0x2b0 [<fffffc00004b6120>] linvfs_get_sb+0x20/0x40 [<fffffc0000384158>] do_kern_mount+0x88/0x1a0 [<fffffc00003a17d8>] do_mount+0x5c8/0x970 [<fffffc00003a2004>] sys_mount+0xc4/0x160 [<fffffc0000316de0>] handle_irq+0x140/0x1c0 [<fffffc0000317400>] do_entInt+0x110/0x180 [<fffffc0000355d88>] __alloc_pages+0x68/0x380 [<fffffc00003560e0>] __get_free_pages+0x40/0xe0 [<fffffc00003a10a8>] copy_mount_options+0x48/0x1b0 [<fffffc00003a1fd8>] sys_mount+0x98/0x160 [<fffffc0000311354>] entSys+0xa4/0xc0 Code: 47ff040c 482052d1 44409002 e4400330 a4200098 4031040f <a02f0000> 48207621 And then mount segfaults. Subsequent attempts to do things to sdb1 would cause the process trying to do them to hang, including running the shutdown script. Then I try putting the drive on an x86 machine, and I get: Starting XFS recovery on filesystem: sdb1 (dev: sdb1) Ending XFS recovery on filesystem: sdb1 (dev: sdb1) XFS mounting filesystem sdb1 And now someone else is using the drive. Without mounting, xfs_check and such work, but xfs_repair doesn't want to do anything for fear of destroying important stuff in the log. I forgot to try fsck, but other people say that doesn't work on 64-bit machines (Sparc), either. With or without trying to mount the XFS partition, mounting the VFAT partition happens without incident. Steps to reproduce: 1. Get a drive with XFS. 2. Try to get it to be unmounted uncleanly so it requires log replays and such. 3. See it fail to work on 64-bit machine without sending it to a 32-bit machine first.
log replay isn't endian and/or word-size independant, if the journal is dirty from i386 it won't replay clean on on alpha or similar is that what is going on here?
Hm, well, it still shouldn't oops, it should recognize an invalid log format if one is found. However, there were some bugs in this area which have been fixed. See http://oss.sgi.com/archives/xfs/2006-05/msg00051.html I can't remember for sure if this led to an oops (Tim?) (feel free to search the archives) Can the original reporter try with a recent kernel? 2.6.15 predates that fix. Also Tim says this: > Log replay is certainly not endian independent but it can handle > different word-size machines nowadays (didn't in the past). > So you can go between i386 and x86_64 for example, with a dirty log. > The log formats will be different between the two for some items > but in recovery with a newer kernel we do conversion on the fly. > It certainly does sound like that is the issue here, Chris. > The back trace is just like a typical different log format problem.
(Oops. Meant to add to the bug instead of just email.) Good point, Eric. If it's an endian difference for a dirty log then it will report it and fail to mount. If it's a word size difference for a dirty log with an old kernel then it almost certainly will crash will a similar backtrace to what was seen here. That shouldn't happen with recent xfs. --Tim
> Can the original reporter try with a recent kernel? Regrettably, I can't. My AlphaStation's power supply gave out, my other AlphaStation's CPU board is giving hardware error codes, and their parts are not interchangeable.
In the absence of further testing, I think it makes sense to assume that this is the problem that has been fixed in the upstream log code.