Bug 12861

Summary: Xorg fails to start "Failed to allocate space for kernel memory manager"
Product: Drivers Reporter: Emil Karlson (jekarlson)
Component: Video(DRI - non Intel)Assignee: Jesse Barnes (jbarnes)
Status: CLOSED INVALID    
Severity: normal CC: airlied, eric, jbarnes, rjw, serge
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.29-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 12398    
Attachments: dmesg from 2.6.29-rc7
Xorg log
strace output
linux config from the first broken commit
strace output for good 2.6.29-rc4-wl (wireless testing)
xorg strace on libdrm-2.4.5 (broken 2.6.29-rc8)

Description Emil Karlson 2009-03-12 12:06:31 UTC
Distribution: Gentoo
Hardware Environment: macbook revision 2, intel gma950
Software Environment: xf86-video-intel-2.6.2 libdrm-2.4.4, gcc-4.3.3
Problem Description: Xorg fails to start

Sorry if I misscategorized this

I get 

(EE) intel(0): Failed to allocate space for kernel memory manager
(==) intel(0): VideoRam: 16124 KB
(II) intel(0): Attempting memory allocation with tiled buffers.
(EE) intel(0): Failed to allocate framebuffer. Is your VideoRAM set too low?
(II) intel(0): Tiled allocation failed.
(WW) intel(0): Couldn't allocate tiled memory, fb compression disabled
(II) intel(0): Attempting memory allocation with untiled buffers.
(WW) intel(0): Failed to allocate EXA offscreen memory.
(II) intel(0): Untiled allocation failed.
(II) intel(0): Couldn't allocate 3D memory, disabling DRI.
(II) intel(0): Attempting memory allocation with untiled buffers.
(WW) intel(0): Failed to allocate EXA offscreen memory.
(II) intel(0): Untiled allocation failed.
(EE) intel(0): Couldn't allocate video memory

Unless I failed bisect this one is the bad commit ... doesn't make sense to me, but I'm not a kernel developer.

fb5ae64fdde29236e1a15e0366946df7060f41f2 is first bad commit
commit fb5ae64fdde29236e1a15e0366946df7060f41f2
Author: Serge E. Hallyn <serue@us.ibm.com>
Date:   Fri Feb 13 14:04:21 2009 +0000

    User namespaces: Only put the userns when we unhash the uid
    
    uids in namespaces other than init don't get a sysfs entry.
    
    For those in the init namespace, while we're waiting to remove
    the sysfs entry for the uid the uid is still hashed, and
    alloc_uid() may re-grab that uid without getting a new
    reference to the user_ns, which we've already put in free_user
    before scheduling remove_user_sysfs_dir().
    
    Reported-and-tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
    Acked-by: David Howells <dhowells@redhat.com>
    Tested-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Comment 1 Emil Karlson 2009-03-12 12:08:36 UTC
Created attachment 20507 [details]
dmesg from 2.6.29-rc7
Comment 2 Emil Karlson 2009-03-12 12:09:39 UTC
Created attachment 20508 [details]
Xorg log
Comment 3 Serge Hallyn 2009-03-14 16:21:38 UTC
Thanks for bisecting.  The result is a bit perplexing...  do you get any other meaningful information in /var/log/syslog or /var/log/messages?  Can you try running X under strace (maybe strace -f -o xoutput startx)?

Also, especially if this isn't 100% reproducible, it might be worth going to the previous commit and booting with that a few times to make sure this is the bad commit.
Comment 4 Emil Karlson 2009-03-14 17:18:26 UTC
Bug is still present in 2.6.29-rc8

I can actually get X to start if I set i915.modeset=1, but not in a useable state (no cursor and no keyboard)

I did explicitly confirm bisect, on second run oldconfig asked about dma remapping (set to N)

kant ~ # diff -u /boot/config /boot/config.old
-# CONFIG_DMAR_DEFAULT_ON is not set

With 99cbb86180bccd77f331f6e8eb7ce26aeea2cb72 I got normal operation, the other reproduces the bug.

  491  git checkout 99cbb86180bccd77f331f6e8eb7ce26aeea2cb72
  492  cp /boot/config-2.6.29-rc4 .config
  493  make oldconfig
  494  make -j3 all && make install modules_install && make clean mrproper && init 6
  495  cd /usr/src/linux-2.6
  496  git checkout fb5ae64fdde29236e1a15e0366946df7060f41f2
  497  make oldconfig
  498  make -j3 all && make install modules_install
Comment 5 Emil Karlson 2009-03-14 17:20:33 UTC
Created attachment 20525 [details]
strace output

Sorry no other useful logs available.
Comment 6 Serge Hallyn 2009-03-14 21:11:21 UTC
Please attach your .config.
Comment 7 Emil Karlson 2009-03-15 04:43:13 UTC
Created attachment 20529 [details]
linux config from the first broken commit
Comment 8 Serge Hallyn 2009-03-15 07:54:45 UTC
Thanks, so both cgroups (and therefore config_user_sched) and user namespaces are turned off.  Meaning put_user_ns(), which is the only thing moved in this commit, is an empty static inline function.  Suspicious :)

Sorry I had missed the fact that you'd appended strace output last night.  Could you also upload the result of a successful strace?

thanks,
-serge
Comment 9 Emil Karlson 2009-03-15 11:15:31 UTC
Created attachment 20533 [details]
strace output for good 2.6.29-rc4-wl (wireless testing)

This may be a bit different since I upgraded to libdrm-2.4.5 and ~-intel-2.6.3 ...

Which btw brings up only black screen on UXA when starting X
[drm:i915_setparam] *ERROR* unknown parameter 4
[drm:i915_getparam] *ERROR* Unknown parameter 6
[drm:i915_getparam] *ERROR* Unknown parameter 6

Probably irrelevant, ask if you want matching strace on broken revision.
Comment 10 Emil Karlson 2009-03-15 12:08:51 UTC
Created attachment 20534 [details]
xorg strace on libdrm-2.4.5 (broken 2.6.29-rc8)

Well generated one anyways...
Comment 11 Serge Hallyn 2009-03-15 12:42:39 UTC
Yeah, please do give a new bad strace output.  I don't know about x to know which differences are relevant.  Based on what I'm seeing so far, though, it looks as though the key lines would be:

3491  open("/dev/agpgart", O_RDWR)      = 10
3491  ioctl(10, AGPIOC_ACQUIRE or APM_IOC_STANDBY, 0) = -1 EBUSY (Device or resource busy)

According to drivers/char/agp/frontend.c this means that agp_fe.current_controller is already set or &agp_bridge->agp_in_use is not 0.  Is X already running on some other console when it fails???
Comment 12 Emil Karlson 2009-03-15 13:10:35 UTC
No other X processes were running.

startx outputs btw
(EE) GARTInit: AGPIOC_INFO failed (Invalid argument)
(EE) intel(0): Failed to allocate space for kernel memory manager
(EE) intel(0): Failed to allocate framebuffer. Is your VideoRAM set too low?
(EE) intel(0): Couldn't allocate video memory

Already attached new strace.
Comment 13 Serge Hallyn 2009-03-15 14:04:26 UTC
Right.  AGPIOC_INFO returns -EINVAL only if agp_fe.current_controller==NULL.  AGPIOC_ACQUIRE returns -EBUSY only if agp_fe.current_controller!=NULL or &agp_bridge->agp_in_use!=0.  So agp_in_use is not 0.

We need someone who knows about X and agp to tell us under what conditions that can happen.  I don't know who that is.

One more question - does the 'No AGP bridge found' early in dmesg also show up when you boot with a kernel that works?
Comment 14 Emil Karlson 2009-03-15 14:27:20 UTC
"Yes No AGP bridge found" can also be found in dmesg on working kernel 2.6.29-rc4-wl
Comment 15 Serge Hallyn 2009-03-16 07:47:44 UTC
Hi David,

could you look over the comments in this post and tell me if it rings any bells about what sorts of situations would cause the case where, when running startx, agp_bridge->agp_in_use > 0?

The kernel bisect results claim that this started happening as of a patch (which I wrote) which, in the test machine, moves some calls to a static inline empty function (put_user_ns) around.
Comment 16 Emil Karlson 2009-03-16 13:22:05 UTC
You can always question the bisect.

Still haven't seen anything newer than 99cbb86180bccd77f331f6e8eb7ce26aeea2cb72 (which I retested) work properly.

I also find curious how oldconfig asks about dma remapping being on by default, even though there is nothing new in kernel configuration system between these 2 revisions afaik. This particular trait seems to correlate with xorg being able to start.

Perhaps I just suck at git, still I am quite confident that there actually is a bug ;) - without git the regression is limited to rc4...rc5.

Also tried not compiling alsa and not compiling dma remapping as well as setting it on as default, none of which has any effect.
Comment 17 Serge Hallyn 2009-03-16 14:48:50 UTC
Sorry - are you saying that without git, rc6 works fine?
Comment 18 Emil Karlson 2009-03-16 15:05:09 UTC
I was not saying anything radically new or relevant, anything >=rc5 fails, sorry for the poor phrasing.
Comment 19 Serge Hallyn 2009-03-17 08:34:00 UTC
(In reply to comment #4)
> Bug is still present in 2.6.29-rc8
> 
> I can actually get X to start if I set i915.modeset=1, but not in a useable
> state (no cursor and no keyboard)
> 
> I did explicitly confirm bisect, on second run oldconfig asked about dma
> remapping (set to N)
> 
> kant ~ # diff -u /boot/config /boot/config.old
> -# CONFIG_DMAR_DEFAULT_ON is not set
> 
> With 99cbb86180bccd77f331f6e8eb7ce26aeea2cb72 I got normal operation, the
> other
> reproduces the bug.
> 
>   491  git checkout 99cbb86180bccd77f331f6e8eb7ce26aeea2cb72

Wait, what is this one?

Was this one successfull?  BC this commit is from Mar 13 and is actually
not applied in Linus' main branch (IIUC).

>   492  cp /boot/config-2.6.29-rc4 .config
>   493  make oldconfig
>   494  make -j3 all && make install modules_install && make clean mrproper &&
> init 6
>   495  cd /usr/src/linux-2.6
>   496  git checkout fb5ae64fdde29236e1a15e0366946df7060f41f2

You also didn't copy /boot/config-2.6.29-rc4 .config between the two builds.  Which could be a problem precisely because the first tree you compiled was a month newer than the second one.

So I think you want to try the following to really confirm whether X starts failing at the commit being blamed:

make clean && make distclean &&  make mrproper && clean this bad boy out ferreal

git checkout 1d7b33f77b2d8b0b1ee767e6f8f05cbd9d72cb7c
cp /boot/config-2.6.29-rc4 .config
make oldconfig && make -j3 all && make install modules_install && init 6
[ confirm that X works ]

git checkout fb5ae64fdde29236e1a15e0366946df7060f41f2
cp /boot/config-2.6.29-rc4 .config
make oldconfig && make -j3 all && make install modules_install && init 6
[ confirm that X does not work ]
Comment 20 Serge Hallyn 2009-03-17 08:38:53 UTC
Sorry, the first commit id should be:

14fa43f53ff3a9c3d8b9662574b7369812a31a97

not

1d7b33f77b2d8b0b1ee767e6f8f05cbd9d72cb7c
Comment 21 Serge Hallyn 2009-03-17 08:41:34 UTC
In fact, it might not hurt to start in a fresh directory doing

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git

and testing there.
Comment 22 Serge Hallyn 2009-03-17 08:49:06 UTC
One more note - if you do a:

git diff 99cbb86180bccd77f331f6e8eb7ce26aeea2cb72  fb5ae64fdde29236e1a15e0366946df7060f41f2

you see there are in fact a lot of changes, including agp and drm related changes.
Comment 23 Emil Karlson 2009-03-18 12:03:18 UTC
Ok did monte carlo on new bisect:

ab657db12d7020629f26f30d287558a8d0e32b41 is first bad commit
commit ab657db12d7020629f26f30d287558a8d0e32b41
Author: Eric Anholt <eric@anholt.net>
Date:   Fri Jan 23 12:57:47 2009 -0800

    drm/i915: Set up an MTRR covering the GTT at driver load.
    
    We'd love to just be using PAT, but even on chips with PAT it gets disabled
    sometimes due to an errata.  It would probably be better to have pat_enabled
    exported and only bother with this when !pat_enabled.
    
    Signed-off-by: Eric Anholt <eric@anholt.net>
    Signed-off-by: Dave Airlie <airlied@linux.ie>

:040000 040000 38f014e90bbc5787444f7be042af496b5ffc4048 ac7d4a1c36531cb76bdff3f32fefbbf918ec89e5 M	drivers

I am still not comfortable of getting rc3 on uname -a, wonder why.

Isn't 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=99cbb86180bccd77f331f6e8eb7ce26aeea2cb72
a proper commit - listed in
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=shortlog;h=v2.6.29-rc5
just before the commit blamed earlier...
Comment 24 Emil Karlson 2009-03-18 12:08:22 UTC
CONFIG_X86_PAT=y
In case you are wondering.

Needless to say I used a fresh clone from linus's tree.
build command 
  479  cd /usr/src/linux-2.6/ && cp /boot/config-2.6.29-rc4 .config && make oldconfig && make -j3 all && make install modules_install && make clean mrproper distclean && init 6
Comment 25 Serge Hallyn 2009-03-21 15:15:53 UTC
Eric Anholdt, Dave Airlie, can you take a quick look at this regression introduced by ab657db12d7020629f26f30d287558a8d0e32b41 ?
Comment 26 Rafael J. Wysocki 2009-04-07 21:06:56 UTC
On Tuesday 07 April 2009, John Emil Karlson wrote:
> Yes the bug is still present in 2.6.29.1.
> 
> Venkatesh Pallipadi fixed an apparently similar bug there. I could ask him 
> for advice once I have more time.
> 
>
> http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.29.y.git;a=commit;h=e1b427acc979431fc7f57a06d0c636c542fdffcc
Comment 27 Rafael J. Wysocki 2009-04-26 11:15:40 UTC
First-Bad-Commit : ab657db12d7020629f26f30d287558a8d0e32b41
Notify-Also : DRI <dri-devel@lists.sourceforge.net>
Comment 28 Jesse Barnes 2009-04-27 16:52:36 UTC
This is the only case I've heard of this problem, and this:
(==) intel(0): VideoRam: 16124 KB
is a bit worrying.  I'd expect problems with a videoram setting this low.  Are you forcing it by hand?  Can you attach your xorg.conf?

Note that forcing it isn't generally necessary.  Physical RAM will only be used when the server actually binds the memory.  The "videoram" line in the log just reports the aperture size (which is the maximum amount the X server is allowed to use).  The memory manager in recent kernels assumes it has the full aperture to play with; if you want to limit its memory usage we'd need to patch it instead.
Comment 29 Jesse Barnes 2009-04-27 17:39:29 UTC
Oh nevermind, looking at the report I see that your AGP init is failing so the driver is just trying to use stolen memory (which explains the 16M videoram).

I guess we just need to figure out why your AGP driver is returning EBUSY when the driver tries to initialize it...  Dave?

Also, does this still happen with 2.6.30-rc?
Comment 30 Emil Karlson 2009-04-30 17:27:00 UTC
Bug is still present in linux-2.6.30-rc4.
Comment 31 Emil Karlson 2009-05-17 16:31:40 UTC
Still present in 2.6.30-rc6.

Also the test hardware is being removed so you may want to close this bug, until someone else does reproduce it.
Comment 32 Jesse Barnes 2009-05-17 23:22:33 UTC
Ok, thanks.  Yeah I haven't heard other reports of this; I suspect a config issue or a related bug, several of which have been fixed now.