Bug 6180
Summary: | XFS oopses on my box sometimes | ||
---|---|---|---|
Product: | File System | Reporter: | Avuton Olrich (avuton) |
Component: | XFS | Assignee: | XFS Guru (xfs-masters) |
Status: | RESOLVED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | ||
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | v2.6.16-rc5+ | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | .config |
Description
Avuton Olrich
2006-03-07 02:18:27 UTC
Created attachment 7526 [details]
.config
| I ran fsck to fix errors on this disk in the last week, | but it appears the problem still exists. Could you unmount and run xfs_repair please, and capture the output. (fsck is a no-op on XFS). I spent some time last week on this, just didnt get a message out to you - my disassembly of the code around the point of your panic gave me no real clues - I think (not 100% sure though...) its an xfs_inode NULL pointer deref, but my disassemly didnt have instructions quite lining up with yours (probably different gcc versions). I'm taking a stab in the dark that you may have a corrupt inode nlink field ondisk for a particular inode, but thats just a guess at this stage based on the oops. Can you try figure out more exactly when this started? (is there a kernel version you've reverted back to where it doesn't occur?). If there is anything unusual in the xfs_repair output, pls add it here. Also, is it always the xdm process (from yoour trace) that triggers this or does it vary? And are your stacktrace EIP always in xfs_read as in your hand-copied trace here? thanks! |Could you unmount and run xfs_repair please, and capture the output. |(fsck is a no-op on XFS). My apoligies, I actually already ran xfs_repair. Since I've had it crash since, so I will run it again and send you info, within the next 24h. |I spent some time last week on this, just didnt get a message out to |you - my disassembly of the code around the point of your panic gave |me no real clues - I think (not 100% sure though...) its an xfs_inode |NULL pointer deref, but my disassemly didnt have instructions quite |lining up with yours (probably different gcc versions). I'm taking a |stab in the dark that you may have a corrupt inode nlink field ondisk |for a particular inode, but thats just a guess at this stage based on |the oops. I was probably running gcc-4.0.2 |Can you try figure out more exactly when this started? (is there a |kernel version you've reverted back to where it doesn't occur?). If I'm sorry, I can't say where it started or ended, I believe after 2.6.15 until now. Every time it happened I would try to upgrade git and recompile hopeing not to see the issue again. I did that probably 3 or 4 times over 2 weeks. |there is anything unusual in the xfs_repair output, pls add it here. I'll post the full output. |Also, is it always the xdm process (from yoour trace) that triggers |this or does it vary? It definitely varys. I've had it happen at different times. |And are your stacktrace EIP always in xfs_read |as in your hand-copied trace here? I'm sorry, I only looked at the older traces long enough to see XFS being the cause and I had to reboot to get more work done :/. The good news is it hasn't happened in the last 5 days. I will get the xfs_repair output to you asap, thanks for looking at this. Granted it took 6 days, but the crash did reoccur. Unfortunately my netconsole didn't catch it so, here I give you the xfs_repair results: #xfs_results -vvv &> xfs_results.out Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... zero_log: head block 26403 tail block 26403 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... error following ag 10 unlinked list - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - clear lost+found (if it exists) ... - clearing existing "lost+found" inode - marking entry "lost+found" to be deleted - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - ensuring existence of lost+found directory - traversing filesystem starting at / ... rebuilding directory inode 128 - traversal finished ... - traversing all unattached subtrees ... - traversals finished ... - moving disconnected inodes to lost+found ... disconnected inode 101622, moving to lost+found disconnected dir inode 102373, moving to lost+found disconnected inode 940819, moving to lost+found disconnected dir inode 16777652, moving to lost+found disconnected dir inode 16777659, moving to lost+found disconnected dir inode 16884299, moving to lost+found disconnected dir inode 33573578, moving to lost+found disconnected dir inode 33578133, moving to lost+found disconnected dir inode 33705470, moving to lost+found disconnected inode 33712740, moving to lost+found disconnected dir inode 34072960, moving to lost+found disconnected dir inode 50331837, moving to lost+found disconnected dir inode 50346747, moving to lost+found disconnected inode 50410891, moving to lost+found disconnected inode 50423887, moving to lost+found disconnected dir inode 50458665, moving to lost+found disconnected dir inode 50550768, moving to lost+found disconnected inode 50925062, moving to lost+found disconnected inode 52335040, moving to lost+found disconnected inode 52335041, moving to lost+found disconnected inode 52335042, moving to lost+found disconnected inode 52335043, moving to lost+found disconnected inode 52418163, moving to lost+found disconnected inode 52418166, moving to lost+found disconnected inode 52438903, moving to lost+found disconnected inode 52438904, moving to lost+found disconnected inode 52438905, moving to lost+found disconnected inode 52438906, moving to lost+found disconnected inode 52438907, moving to lost+found disconnected inode 52438908, moving to lost+found disconnected dir inode 67109002, moving to lost+found disconnected dir inode 67109030, moving to lost+found disconnected inode 67164119, moving to lost+found disconnected dir inode 67175491, moving to lost+found disconnected dir inode 67191117, moving to lost+found disconnected dir inode 67206707, moving to lost+found disconnected dir inode 83886246, moving to lost+found disconnected inode 83950331, moving to lost+found disconnected dir inode 83963818, moving to lost+found disconnected inode 84151441, moving to lost+found disconnected dir inode 84424758, moving to lost+found disconnected dir inode 84438505, moving to lost+found disconnected dir inode 84569913, moving to lost+found disconnected dir inode 100663431, moving to lost+found disconnected dir inode 100837292, moving to lost+found disconnected inode 102206716, moving to lost+found disconnected inode 102206719, moving to lost+found disconnected inode 102206734, moving to lost+found disconnected inode 102206807, moving to lost+found disconnected inode 102206811, moving to lost+found disconnected inode 102207428, moving to lost+found disconnected dir inode 102379461, moving to lost+found disconnected inode 117472299, moving to lost+found disconnected inode 117473359, moving to lost+found disconnected inode 117485360, moving to lost+found disconnected inode 117485365, moving to lost+found disconnected inode 117485366, moving to lost+found disconnected inode 117485367, moving to lost+found disconnected inode 117485368, moving to lost+found disconnected inode 117485369, moving to lost+found disconnected inode 117485370, moving to lost+found disconnected inode 117485371, moving to lost+found disconnected inode 117485372, moving to lost+found disconnected inode 117485373, moving to lost+found disconnected dir inode 117492066, moving to lost+found disconnected dir inode 117527679, moving to lost+found disconnected dir inode 117533817, moving to lost+found disconnected inode 117544364, moving to lost+found disconnected inode 117548115, moving to lost+found disconnected dir inode 117647435, moving to lost+found disconnected dir inode 134300742, moving to lost+found disconnected dir inode 134301107, moving to lost+found disconnected dir inode 150995538, moving to lost+found disconnected dir inode 151010925, moving to lost+found disconnected dir inode 151070725, moving to lost+found disconnected dir inode 151476838, moving to lost+found disconnected dir inode 167772304, moving to lost+found disconnected dir inode 167772437, moving to lost+found disconnected inode 167772677, moving to lost+found disconnected inode 167824533, moving to lost+found disconnected dir inode 167828338, moving to lost+found disconnected dir inode 167832595, moving to lost+found disconnected inode 167836641, moving to lost+found disconnected dir inode 184549558, moving to lost+found disconnected inode 184550730, moving to lost+found disconnected inode 184599459, moving to lost+found disconnected inode 185581966, moving to lost+found disconnected dir inode 201326722, moving to lost+found disconnected dir inode 201326779, moving to lost+found disconnected dir inode 201731800, moving to lost+found disconnected dir inode 218103941, moving to lost+found disconnected inode 218119733, moving to lost+found disconnected dir inode 218129489, moving to lost+found disconnected inode 218244421, moving to lost+found disconnected inode 234890099, moving to lost+found disconnected inode 234890102, moving to lost+found disconnected inode 234941238, moving to lost+found disconnected dir inode 251658371, moving to lost+found disconnected dir inode 251658416, moving to lost+found disconnected dir inode 251695387, moving to lost+found disconnected inode 251894389, moving to lost+found disconnected inode 251894392, moving to lost+found disconnected inode 251894394, moving to lost+found disconnected inode 251894395, moving to lost+found disconnected inode 251894397, moving to lost+found disconnected inode 251894398, moving to lost+found disconnected inode 251894399, moving to lost+found disconnected inode 251897728, moving to lost+found disconnected inode 251897731, moving to lost+found disconnected inode 251897732, moving to lost+found disconnected inode 251897733, moving to lost+found disconnected inode 251897734, moving to lost+found disconnected inode 252658848, moving to lost+found Phase 7 - verify and correct link counts... resetting inode 102373 nlinks from 0 to 2 resetting inode 16777652 nlinks from 0 to 2 resetting inode 16777659 nlinks from 0 to 2 resetting inode 16884299 nlinks from 0 to 2 resetting inode 33573578 nlinks from 0 to 2 resetting inode 33578133 nlinks from 0 to 2 resetting inode 33705470 nlinks from 0 to 2 resetting inode 34072960 nlinks from 0 to 2 resetting inode 50331837 nlinks from 0 to 2 resetting inode 50346747 nlinks from 0 to 2 resetting inode 50458665 nlinks from 0 to 2 resetting inode 50550768 nlinks from 0 to 2 resetting inode 67109002 nlinks from 0 to 2 resetting inode 67109030 nlinks from 0 to 2 resetting inode 67175491 nlinks from 0 to 2 resetting inode 67191117 nlinks from 0 to 2 resetting inode 67206707 nlinks from 0 to 2 resetting inode 83886246 nlinks from 0 to 2 resetting inode 83963818 nlinks from 0 to 2 resetting inode 84424758 nlinks from 0 to 2 resetting inode 84438505 nlinks from 0 to 2 resetting inode 100663431 nlinks from 0 to 2 resetting inode 100837292 nlinks from 0 to 2 resetting inode 102379461 nlinks from 0 to 2 resetting inode 117492066 nlinks from 0 to 2 resetting inode 117527679 nlinks from 0 to 2 resetting inode 117533817 nlinks from 0 to 2 resetting inode 117647435 nlinks from 0 to 2 resetting inode 134300742 nlinks from 0 to 2 resetting inode 134301107 nlinks from 0 to 2 resetting inode 150995538 nlinks from 0 to 2 resetting inode 151010925 nlinks from 0 to 2 resetting inode 151070725 nlinks from 0 to 2 resetting inode 151476838 nlinks from 0 to 2 resetting inode 167772304 nlinks from 0 to 2 resetting inode 167772437 nlinks from 0 to 2 resetting inode 167828338 nlinks from 0 to 2 resetting inode 167832595 nlinks from 0 to 2 resetting inode 184549558 nlinks from 0 to 2 resetting inode 201326722 nlinks from 0 to 2 resetting inode 201326779 nlinks from 0 to 2 resetting inode 218103941 nlinks from 0 to 2 resetting inode 218129489 nlinks from 0 to 2 resetting inode 251658371 nlinks from 0 to 2 resetting inode 251658416 nlinks from 0 to 2 resetting inode 251695387 nlinks from 0 to 2 done Well, will you look at that? | I'm taking a | stab in the dark that you may have a corrupt inode nlink field ondisk | for a particular inode, but thats just a guess at this stage based on | the oops. Spot on - so, thats why you're getting a panic anyway, somehow one of those inodes with nlink==0 is visible through the directory hierarchy and someone is accessing it; someone else takes it away at the same time, and boom. The real root of the problem though is how did the nlink field get that way... this would probably have happened at some point well before your panic, so we've got no real clues to go on unfortunately. :( Hohum, so, I'm back at trying to get you to narrow down what you do to reproduce it... but I've got no good ideas on how you might do that. Only other data points here - we can say this is quite unlikely to be a recent regression (lotsa people seem to be asking..), since I can't think of anything thats changed recently in XFS that would affect the inode link count (within XFS anyway, perhaps some VFS change is making it more likely to occur, but...). Noone else seems to be hitting this though, which makes me wonder if theres something a bit unusual about your workload / filesystem accesses thats tickling this ... anything you can think of? Oh, and the "error following ag 10 unlinked list" repair message is an odd one too, I need to go think about what might cause that a bit, cos it will explain some of your unlinked inodes. I'd be interested in seeing if it happens again now its repaired, and if so, whether it happens after a crash or unclean shutdown (i.e. no unmount), and if so, whether that same xfs_repair message gets dumped. cheers. |Spot on - so, thats why you're getting a panic anyway, somehow one of |those inodes with nlink==0 is visible through the directory hierarchy |and someone is accessing it; someone else takes it away at the same |time, and boom. |The real root of the problem though is how did the nlink field get that |way... this would probably have happened at some point well before your |panic, so we've got no real clues to go on unfortunately. :( To be quite honest I don't think I'm really doing anything unusual, unless suspend and resume are unusual. I did xfs_repair this mount before and it did happen again afterwards. I'm fairly sure it started after 2.6.15, of course I could be mistaken. One thing I can say is there's nothing that really _reproduces_ it afaict. I have had it happen during an emerge, have had it happening after just comeing back to the computer to use it in the middle of an x session with nothing else happening. Simply put I can name most of the normal stuff that runs on this computer. Apache2, xdm, fvwm2, konqueror, kate (editor), jasspa microemacs, emerge. That's the most used programs. I have had it happen during high cpu usage and had it happen alot during no cpu usage. I don't suppose you have anything to take out the metadata, so you can research it? reiser4 has a program like this iirc. My netconsole isn't hooked up correctly atm, so I'm going to reconfigure that, rebuild a newer git and see if I can get you a newer dump. This one will be with GCC-4.1, if that'll help you any. If you can think of anything I can do to help please let me know. If you want I can also try going back to 2.6.15 run it for a week see what happens. Any ideas of where to go from here? Rather than going back to an earlier kernel, could you try a build with PREEMPT disabled and let me know if that makes a difference. One other question - do you run the filesystem out of space very often? (do these problems happen near/after running out of space for example?) | I don't suppose you have anything to take out the metadata, so you can | research it? reiser4 has a program like this iirc. That wont help here - I know what the problem with the metadata ondisk is, the issue is now figuring out how it got into that state. cheers. |Rather than going back to an earlier kernel, could you try a build with |PREEMPT disabled and let me know if that makes a difference. I will do that, though as I said, without a good week to test I can't really be sure it's not going to happen again. |One other question - do you run the filesystem out of space very often? |(do these problems happen near/after running out of space for example?) Actually quite the opposite. Max disk usage is probably always about 20%, so that's definitely not the issue here. Turned preempt off and it has crashed again, took 2 days but happened. It seems to always corrupt my git directory of phpMp. It usually (but not always) crashes on me when I'm physically on the computer, editing php in jasspa-microemacs, in X/fvwm2, going back and forth between that and my konqueror browser which I also use apache2 heavily, all on localhost not connected to the net. I'm not sure any of that is the problem, of course, I really have no idea what else I can give you to help you find the issue, unless you want me to revert to an earlier kernel version. | unless you want me to revert to an earlier kernel version. That sounds like the best bet at this stage - that will at least give us more confidence as to whether this is a regression from .15 or not. Could you send me an "ls -ali" of that directory that always seems to be affected too please? thanks. |Could you send me an "ls -ali" of that directory that always seems to be |affected too please? sbh@micromachine ~/public_html/phpMp $ ls -ali total 220 168673823 drwxr-xr-x 5 sbh users 4096 Mar 19 18:24 . 83931886 drwxr-xr-x 6 sbh users 125 Mar 19 16:06 .. 184753491 drwxr-xr-x 7 sbh users 123 Mar 18 18:08 .git 168673824 -rw-r--r-- 1 sbh users 17992 Mar 18 18:08 COPYING 168673825 -rw-r--r-- 1 sbh users 8031 Mar 18 18:08 ChangeLog 168673826 -rw-r--r-- 1 sbh users 1276 Mar 18 18:08 INSTALL 168673827 -rw-r--r-- 1 sbh users 2151 Mar 18 18:08 README 168673828 -rw-r--r-- 1 sbh users 603 Mar 18 18:08 TODO 84169490 drwxr-xr-x 2 apache users 66 Mar 19 14:43 cache 168673795 -rw-r--r-- 1 sbh users 10851 Mar 19 18:24 config.php 168673793 -rw-r--r-- 1 sbh users 10835 Mar 19 18:19 config.php~ 101256705 drwxr-xr-x 2 sbh users 25 Mar 18 18:08 contrib 168673792 -rw-r--r-- 1 sbh users 19500 Mar 19 18:24 features.php 168673794 -rw-r--r-- 1 sbh users 19503 Mar 19 18:24 features.php~ 168673831 -rw-r--r-- 1 sbh users 4643 Mar 18 18:08 index.inc 168673832 -rw-r--r-- 1 sbh users 9275 Mar 18 18:08 index.php 168673833 -rw-r--r-- 1 sbh users 22625 Mar 18 18:08 main.php 168673834 -rw-r--r-- 1 sbh users 3638 Mar 18 18:08 mpd-favicon.ico 168673835 -rw-r--r-- 1 sbh users 11919 Mar 18 18:08 mtable.inc 168673836 -rw-r--r-- 1 sbh users 27178 Mar 18 18:08 playlist.php 168673837 -rw-r--r-- 1 sbh users 1626 Mar 18 18:08 sort.php 168673838 -rw-r--r-- 1 sbh users 6207 Mar 18 18:08 theme.php 168673839 -rw-r--r-- 1 sbh users 832 Mar 18 18:08 transparent.gif 168673840 -rw-r--r-- 1 sbh users 5176 Mar 18 18:08 xml-parse.php Hrm, on 2.6.15.6 for 4 days so far and no signs of wanting to crash. Thanks Avuton, I think the next step to understanding this that we'll need to take here is to get it nailed down to a small set of changes - I understand git bisect is the tool of choice for doing this kind of fault isolation. I know that will take awhile, given the time-between-failures, but not many other options at this stage. :| I'd really like to get to the bottom of this though, so if you could do that we'd really appreciate it. thanks. OK, I will look into doing this, and as you said it will take a while, but I will continue until I get something conclusive. Thanks! OK, I'm not really sure when it got fixed, but I updated to 2.6.17-rc1 and things seem fine (after a week). Seems fixed, will reopen if the problem presents itself again. Thanks for the help. I possibly have a related problem: http://bugzilla.kernel.org/show_bug.cgi?id=6380 |