this bug is with intel i915. system becomes _very_ sluggish once this fires off. the original thread is here..: https://lkml.org/lkml/2012/10/23/498 attached is intel_error_decode which has intel_error_state and dmesg of the last crash. I will start a bisect to see if I get anywhere on this(might be a few weeks since this is a slow trigger)will post if I get anywhere.
Created attachment 84921 [details] dmesg of crash dmesg of crash.
such a pain.. 3.4 mbs is to big for bugzilla here is the link to a filesharing site with intel_error_decode..: http://www.fileserve.com/file/FYdxvFC/intel_error_decode just scroll to the middle of the page, wait for the timer, do the captcha then download. ignore all the other crap(since _all_ these sites do this crap with popups etc..)
Ok, I'll guess we'll move it back to the mailing list again. Btw, I've asked you to report the bug on bugs.freedesktop.org since that bugzilla does not have a stupid size limitation for attachments and happily handles an i915_error_state ;-)
just for the record please dont close this.. I am still bisecting this. about half way through the bisect. hopefully I make sure I get the proper commits down(good/bad) so I dont go astray with the bisect. from looking at it.. its between: 3.6.0-00722-gbd0d104 and 3.6.0-06942-g11126c6 I have about 200 revisions to test.. on this before I can get an idea.
okay.. still bisecting a few more commits, but not sure if I need to do so, given that the rest of the commits are with kernel-doc and fat. image is here through gitk: http://i1341.photobucket.com/albums/o745/justinmattock/bisect.png the bisect log is here: (3 commits left) http://fpaste.org/JBuh/ from now I should be running a good kernel(according to the bisect): commit 65478428 but might go and finish the bisect early so I can start running commit 66b8ef677 (revert) will post what I find out..
seems there might already be a fix upstream for this..: 5b7aa7d5b will run the bisect for a few more days, then can pull to the Mainline to test this to see if I crash and burn. seems after three days I get the crash, so whatever (parent/child) process is firing off is getting denied by this commit(or old commit now)
My Dell Latitude E6500 has this problem. I can test any patches or maybe do a bisect if needed.
what kernel version are you on? over here the bisect is pointing to: [66b8ef67756b3051bf42a077a82c3c5c279caa5b] device_cgroup: add "deny_all" in dev_cgroup structure keep in mind looking through the history of device_cgroup.c I am noticing lots of changes with deny_all commit so the current Mainline could already be fixed. with that in mind will continue bisecting.. once finished will pull the current Mainline to see if I hit this, then will go from there. NOTE:during the bisect I am getting consistent crash at around 3 days of being up(uptime) but other machines could be different
This has been happening on every 3.7-RC I've tried. It occurs within minutes of booting up and I have to give up and reboot into 3.6.x for a usable desktop. The main activity on fixing this bug appears to be here. I do use the rc6 boot flag and haven't tried disabling it. https://bugs.freedesktop.org/show_bug.cgi?id=55984
okay so doing a bisect on your part should not take so long since you can easily hit this in my case was 3 days before firing off, causing the bisect to last over a month now to get proper results(hopefully). if you can.. try the bisect to see if you get the same results since its easily reproduceable. right now im testing a commit and wont reboot unless I hit the bug, or after 5 days of running this revision(only way to really test this from here) then if good I will pull to the Mainline and run that to see if all the device_cgroup fixes are fixing this, if not, then I revert the cgroup stuff on the Mainline to verify its _really_ the cause of this crash..
Tested 3.7.0-RC8 with device_cgroup.c changes reverted back to your bisect revision. I've been unable to reproduce the crash so far. I will post again if it still does it.
yeah this is so we can verify that the cgroup deny_all policy is blocking a process causing the GPU to crap out. kind of like SELinux, having one AVC's denial can keep X11 from starting when in enforcing mode, but in this case(reading through the code) its between parent/child processes mostly the child process is being denied. Im guessing probably the reason for different times on different machines for when this crash occurs. but keep in mind I dont know much about cgroup to really give a definitive answer. main thing now is finding the culprit(commit) so atleast we can get a handle on this bug good sign over here: uptime: 13:23pm up 3 days 14:24, 4 users, load average: 0.20, 0.14, 0.14 the longest on a bad kernel was 3 days 04:00 thanks for testing/helping on this comment #11
Sorry, bad news. My desktop just froze briefly and logged the error. Took almost 90 minutes this time.
and this is before the cgroup stuff? if so then my bisect is astray.. maybe you can bisect this since you have a more stable crash than I on this?
This is the diff that I backed out of 3.7.0-RC8. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blobdiff_plain;f=security/device_cgroup.c;h=b08d20c66c2e17e0fa8333984df8eb4a82595e0b;hp=e3ce02a00ffcdadc1ee202bee3bfde6a2c3f5198;hb=HEAD;hpb=66b8ef67756b3051bf42a077a82c3c5c279caa5b I can start a bisect sometime this week.
okay.. so you are on a kernel before commit: 66b8ef6775 over here I am on commit: 65478428443 in of which has not crashed the kernel. I am going ahead with my plan to finish this bisect, then will pull to the current to verify the Mainline. I can do another bisect, but not until I finish with these other tests of verifying that the Mainline is fixed or not. as for you, bisecting would be great since you can easily reproduce this.
This is where I am on my bisect. Might have it finished tonight or tomorrow. Bisecting: 14 revisions left to test after this (roughly 4 steps) [0327d6ba998ca181013a5a1709701a6532a41972] drm/i915: Extract general object init routine git bisect start # bad: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6 git bisect bad a0d271cbfed1dd50278c6b06bead3d00ba0a88f9 # good: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6 git bisect good a0d271cbfed1dd50278c6b06bead3d00ba0a88f9 # good: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6 git bisect good a0d271cbfed1dd50278c6b06bead3d00ba0a88f9 # bad: [ddffeb8c4d0331609ef2581d84de4d763607bd37] Linux 3.7-rc1 git bisect bad ddffeb8c4d0331609ef2581d84de4d763607bd37 # good: [24d7b40a60cf19008334bcbcbd98da374d4d9c64] ARM: OMAP2+: PM: MPU DVFS: use generic CPU device for MPU-SS git bisect good 24d7b40a60cf19008334bcbcbd98da374d4d9c64 # bad: [21c8715f0a1f4df8bfa2bd6f3915e5e33c1c2e6e] ARM: integrator: use __iomem pointers for MMIO, part 2 git bisect bad 21c8715f0a1f4df8bfa2bd6f3915e5e33c1c2e6e # bad: [33e2a4227ddff7c18921ac175fae3ab0e3ff8a76] lib/decompress.c add __init to decompress_method and data git bisect bad 33e2a4227ddff7c18921ac175fae3ab0e3ff8a76 # good: [3a494318b14b1bc0f59d2d6ce84c505c74d82d2a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace git bisect good 3a494318b14b1bc0f59d2d6ce84c505c74d82d2a # bad: [268d28371cd326be4dfcd7eba5917bf4b9d30c8f] Merge branch 'drm-nouveau-next' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next git bisect bad 268d28371cd326be4dfcd7eba5917bf4b9d30c8f # skip: [8c0bd3c02d52eff11396e81b4d217ee668e03528] drm/i915: placeholder getparam git bisect skip 8c0bd3c02d52eff11396e81b4d217ee668e03528 # bad: [37e680a15ff6375ff02773161f817e90a38c51f7] drm/i915: Introduce drm_i915_gem_object_ops git bisect bad 37e680a15ff6375ff02773161f817e90a38c51f7 # good: [1aa4b628ee63f55db96c7e820257b6e4948abb74] drm/i915: don't save all the encoder/crtc state in set_config git bisect good 1aa4b628ee63f55db96c7e820257b6e4948abb74 # bad: [4772eaebcdf86dd65630339dbe58316b90f80aed] drm/i915: add gen7_render_ring_flush git bisect bad 4772eaebcdf86dd65630339dbe58316b90f80aed # good: [ce5fdd63752390b89804b1e246de267b398d1515] gma500: fix incorrect declaration. git bisect good ce5fdd63752390b89804b1e246de267b398d1515
One more. Bisecting: 6 revisions left to test after this (roughly 3 steps) [de9932d13c95c4125d2ed8703f5574fcc108bbcf] drm/i915: fix reassignment of variable "intel_dp->DP" I was running glxgears and other video and disk intensive processes when it happened and this was in the console. Not sure if that is useful. intel_do_flush_locked failed: Input/output error
Should have this done tomorrow. Current build is stable so far. Bisecting: 2 revisions left to test after this (roughly 2 steps) [b7abb7144356536c41a81eda2bbfd46027e529f6] drm/i915: Show (count, size) of purgeable objects in i915_gem_objects Bisecting: 1 revision left to test after this (roughly 1 step) [6c085a728cf000ac1865d66f8c9b52935558b328] drm/i915: Track unbound pages # bad: [0327d6ba998ca181013a5a1709701a6532a41972] drm/i915: Extract general object init routine git bisect bad 0327d6ba998ca181013a5a1709701a6532a41972 # bad: [de9932d13c95c4125d2ed8703f5574fcc108bbcf] drm/i915: fix reassignment of variable "intel_dp->DP" git bisect bad de9932d13c95c4125d2ed8703f5574fcc108bbcf # bad: [b7abb7144356536c41a81eda2bbfd46027e529f6] drm/i915: Show (count, size) of purgeable objects in i915_gem_objects git bisect bad b7abb7144356536c41a81eda2bbfd46027e529f6
well.. much faster you doing this than waiting a month for any result.. looking at the bisect logs seems you are in the right direction(but you never know). let me know what you find out. over here running the current Mainline for a few days to see if I hit this. bisect is finished, will post the results for the record. Thanks for doing the bisect on this!
Here's the rest of the bisect. Will attach full bisect log. Bisecting: 0 revisions left to test after this (roughly 0 steps) [225067eedf1f4d4065940232c9069fcb255206ee] drm/i915: move functions around 6c085a728cf000ac1865d66f8c9b52935558b328 is the first bad commit commit 6c085a728cf000ac1865d66f8c9b52935558b328 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Aug 20 11:40:46 2012 +0200 drm/i915: Track unbound pages
Created attachment 88651 [details] Full bisect log One thing confuses me. I tried 3.7-rc1 and it had the crash, so my bisect was between 3.6.0 and 3.1-rc1. But part way through the bisect, I had an unbootable kernel that couldn't mount the partition, so I had to do a skip. After that, it started including rc3 and then rc2. So it's possible that my bisect is not right, or that there are two patches that cause the problem.
On Sat, Dec 8, 2012 at 5:49 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > --- Comment #22 from Brad Jackson <bjackson0971@gmail.com> 2012-12-08 > 16:49:48 --- > Created an attachment (id=88651) > --> (https://bugzilla.kernel.org/attachment.cgi?id=88651) > Full bisect log > > One thing confuses me. I tried 3.7-rc1 and it had the crash, so my bisect was > between 3.6.0 and 3.1-rc1. But part way through the bisect, I had an > unbootable > kernel that couldn't mount the partition, so I had to do a skip. After that, > it > started including rc3 and then rc2. So it's possible that my bisect is not > right, or that there are two patches that cause the problem. That's completely expected - the kernel version and also the one git shows you just goes backwards in history until it finds the first tag and uses that. Which means if the drm-intel-next tree for 3.7 is e.g. based of 3.6-rc2, and your bisect is only in there (since all the other branches merged into 3.7-rc1 seem to work) git lets you test just commits on that branch. It looks like git bisect went back in time before the 3.6 release, but in reality the commits your testing with that 3.6-rc2 kernel have never been in 3.6, so really need to be tested. You're not the first to get totally confused by this though ;-)
yeah bisecting can be linear that way.. one minute its at 3.6 then its at 3.2 then back to 3.6 main thing is getting the proper good/bad so it does not go astray. in your case that might have occurred when you had to skip a revision because of brokenness(hopefully not though). over here still running the Mainline with no signs of breakage, but still have another day and a half to run since I seem to be hitting this at/around 3 days. will keep you updated
Created attachment 88891 [details] bisect_log_i915_hangcheck_timer_elapsed the following attachment is the bisect log that I did for this bug. at the moment I have been running the Mainline kernel without hitting this bug. normally at/around 3days I would reliably hit this and as it stands my uptime is: 19:58pm up 3 days 22:39, 6 users, load average: 0.21, 0.12, 0.14 which is a good sign that this bug is fixed. but will run this kernel for 5 days to really make sure. will post and update if this fires off.
I just got the hang on a 3.7.0 release kernel after a couple hours of uptime.
over here, still running the Mainline without hitting this.. will continue to run to see if it fires off. as for comment#26 maybe try another go at the bisect, you may need to even perform a git rebase after the initial bisect to perform a bisect in the merge to find the culprit. also since you can trigger this in a short amount of time, you may try: CONFIG_CGROUPS = n before doing any bisects to verify that the cgroup commits are causing this or not.
If I had a patch to cleanly revert 6c085a728cf000ac1865d66f8c9b52935558b328 from 3.7.0, I could see if that fixes the problem. I tried to revert it against 3.7.0-rc8, but there were errors.
downstream at launchpad: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1093651?comments=all
Smells like yet another variation of the 3.7 gpu hang regression. Please attach a dmesg with drm.debug=0xe (I've lost track of your hw) and test the following two kernel patches (alone or in combination, we're still trying to figure out what's going on): https://bugs.freedesktop.org/attachment.cgi?id=72022 https://bugs.freedesktop.org/attachment.cgi?id=71805
Presumably fixed with commit 262b6d363fcff16359c93bd58c297f961f6e6273 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 15 16:17:54 2013 +0000 drm/i915: Invalidate the relocation presumed_offsets along the slow path and paper over by commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Jan 10 18:03:00 2013 +0100 drm/i915: Revert shrinker changes from "Track unbound pages" and commit 901593f2bf221659a605bdc1dcb11376ea934163 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Dec 19 16:51:06 2012 +0000 drm: Only evict the blocks required to create the requested hole Thanks a lot for reporting this regression and please reopen if you're still experiencing issues. Note that the first commit is only available in drm-intel-fixes for now.