Bug 14535

Summary: Memory corruption detected in low memory
Product: Drivers Reporter: erki85
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: CLOSED CODE_FIX    
Severity: high CC: airlied, alan, alexdeucher, auxsvr, gary.pajer, hmh, jcristau, matifamin, razamatan, secsaba, shy, torsten
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31-5 Subsystem:
Regression: No Bisected commit-id:
Attachments: kernel patch to block bad behaviour

Description erki85 2009-11-03 04:17:14 UTC
When trying to run any game that uses hardware acceleration, they don't. Instead filesystem gets corrupted and there's no access to almost any application (for example ls reports, that there are no files in /usr/bin).
After restart and fsck things go back to normal.

In /var/log/kernel.log there are lots of Corrupted low memory at c0004efc (4efc phys) = 00ffffff messages and after these

Nov  3 05:49:01 arch kernel: ------------[ cut here ]------------
Nov  3 05:49:01 arch kernel: WARNING: at arch/x86/kernel/check.c:134 check_for_bios_corruption+0xd2/0xe0()
Nov  3 05:49:01 arch kernel: Hardware name: 23748TG
Nov  3 05:49:01 arch kernel: Memory corruption detected in low memory
Nov  3 05:49:01 arch kernel: Modules linked in: ipv6 ext2 fan snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device joydev snd_pcm_oss radeon ttm snd_intel8x0 snd_ac97_codec ac97_bus thinkpad_acpi pcmcia drm i2c_algo_bit snd_mixer_oss iTCO_wdt iTCO_vendor_support rfkill led_class nvram uhci_hcd ehci_hcd ppdev video output snd_pcm e1000 ipw2200 parport_pc battery psmouse sr_mod cdrom nsc_ircc snd_timer i2c_i801 snd soundcore floppy ac usbcore intel_agp agpgart libipw lib80211 snd_page_alloc i2c_core thermal serio_raw button yenta_socket rsrc_nonstatic pcmcia_core irtty_sir sir_dev sg evdev irda crc_ccitt shpchp pci_hotplug lp parport cpufreq_ondemand acpi_cpufreq freq_table processor rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sd_mod ata_piix ata_generic pata_acpi libata scsi_mod
Nov  3 05:49:01 arch kernel: Pid: 6, comm: events/0 Tainted: G        W  2.6.31-ARCH #1
Nov  3 05:49:01 arch kernel: Call Trace:
Nov  3 05:49:01 arch kernel: [<c104654a>] ? warn_slowpath_common+0x7a/0xc0
Nov  3 05:49:01 arch kernel: [<c102a092>] ? check_for_bios_corruption+0xd2/0xe0
Nov  3 05:49:01 arch kernel: [<c1046607>] ? warn_slowpath_fmt+0x37/0x60
Nov  3 05:49:01 arch kernel: [<c102a092>] ? check_for_bios_corruption+0xd2/0xe0
Nov  3 05:49:01 arch kernel: [<c102a0a0>] ? check_corruption+0x0/0x50
Nov  3 05:49:01 arch kernel: [<c102a0b3>] ? check_corruption+0x13/0x50
Nov  3 05:49:01 arch kernel: [<c105d03f>] ? worker_thread+0x11f/0x280
Nov  3 05:49:01 arch kernel: [<c1062de0>] ? autoremove_wake_function+0x0/0x60
Nov  3 05:49:01 arch kernel: [<c105cf20>] ? worker_thread+0x0/0x280
Nov  3 05:49:01 arch kernel: [<c106298c>] ? kthread+0x8c/0xa0
Nov  3 05:49:01 arch kernel: [<c1062900>] ? kthread+0x0/0xa0
Nov  3 05:49:01 arch kernel: [<c10048e7>] ? kernel_thread_helper+0x7/0x10
Nov  3 05:49:01 arch kernel: ---[ end trace a7919e7f17c0a727 ]---


system: Arch Linux, Thinkpad T42 with Radeon 7500
Xorg: 1.6.3.901
xf86-video-ati: 6.12.4
Comment 1 erki85 2009-11-03 05:23:19 UTC
Same problems with newer xorg and xf86-video-ati.
With KMS enabled no such things happen, but everything is red in games and even glxgears.
Comment 2 Henrique de Moraes Holschuh 2010-01-03 16:22:53 UTC
We have multiple reports of filesystem corruption on DRI radeon.  Kernels 2.6.30 and later with mesa 7.6 (galium disabled) cause the issue.  Kernel 2.6.30 with mesa 2.5 doesn't cause the issue.

It is likely this bug.

What is the status of this issue?  It causes filesystem corruption and data loss, so it is a very nasty one.

Related (gives more data about the bug, and the reporter is a potential tester for fixes):
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=550977
Comment 3 Torsten Landschoff 2010-01-09 23:03:45 UTC
I still have this problem as well, but I originally ran into this with a 2.6.24 kernel (with OpenVZ support). I first blamed it on my old + patched kernel, but upgrading the kernel did not resolve the problems.

So I'd expect it to be either a bug in Mesa or some kernel bug only exposed with latest Mesa changes.
Comment 4 Alex Deucher 2010-01-11 14:36:33 UTC
Can you try mesa git master or the 7.7 branch?  Dave fixed some potential issues there:
http://cgit.freedesktop.org/mesa/mesa/commit/?id=554043bff72ced41b2a5e03e61cbc087bb41bd3d
http://cgit.freedesktop.org/mesa/mesa/commit/?id=42f2880ffd0b847df7cb56b7f7f0747287e0b08f
Comment 5 Gary 2010-01-17 23:51:15 UTC
I tried mesa 7.7 on Kubuntu 9.10, kernel 2.6.31-17, and the bug persists there.
See details in Ubuntu bug report #474928: https://bugs.launchpad.net/ubuntu/+source/mesa/+bug/474928?comments=all
Comment 6 Dave Airlie 2010-01-18 00:29:58 UTC
did you confirm the patches were in that build?
Comment 7 Gary 2010-01-18 00:59:09 UTC
No.  I didn't know to do that.  How can I check?

In the meantime, I also tried mesa 7.6.0 (default Ubuntu 9.10 package) with kernel 2.6.32-02063203-generic.  No change.
Comment 8 erki85 2010-01-18 14:29:19 UTC
Tried mesa-git (+ati-dri-git and libgl-git) with xf86-video-ati-git today. Xorg 1.7.3.902.

With KMS enabled, glxgears's gears are red and green again (no blue) and running etracer just produces blank screen. Although I can restart X and continue normal working.

With KMS disabled, glxgears does not display error message about radeon_tcl.c anymore, but running etracer locks up the system. I can sort of see the first screen, which has "Press any key to continue", but the rest that should be there is garbled. When I press any key X stops and I cannot close it. REISUB works though this time (didn't before).

When I ran HOMM4 in wine, I got the usual Memory corruption detected in low memory thingie. Couldn't restart X, couldn't REISUB.
Comment 9 Shyouzou Sugitani 2010-01-19 01:26:24 UTC
I tried mesa 7.7-1(from Debian experimental) on Debian sid with kerenl 2.6.32-tr
unk-686.

With RV200(uses R100 microcode) glxgears and etracer works fine.(No system lock up.)
But with RV280(uses R200 microcode) glxgears locks up the system.

I also tried mesa 7.6.1-1(from Debian sid).
With both RV200 and RV280 glxgears locks up the system.
Comment 10 Julien Cristau 2010-01-19 01:32:55 UTC
> --- Comment #9 from Shyouzou Sugitani <shy@m3.catvmics.ne.jp>  2010-01-19
> 01:26:24 ---
> I tried mesa 7.7-1(from Debian experimental) on Debian sid with kerenl
> 2.6.32-tr
> unk-686.

fwiw these mesa packages are built from mesa_7_7_branch commit 6d6c9c66.
Comment 11 erki85 2010-01-19 01:38:38 UTC
Tried 2.6.33-rc4 with same software as in my previous comment and now everything seems to be a bit better.

KMS works, glxgears has normal colours again, couldn't test wine+HOMM though because it didn't start working. Hitman2 worked though. Etracer started normally, but when clicking Play button, it exited with "drmRadeonCmdBuffer: -12. Kernel failed to parse or rejected command stream. See dmesg for more info.", dmesg says "[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation !".

But haven't yet got memory or fs corruption with that setup.
Comment 12 erki85 2010-01-19 01:53:26 UTC
Although with KMS disabled etracer first screen still doesn't look normal and after keypress system hangs (REISUB out, no corruption).
Comment 13 erki85 2010-01-19 02:19:30 UTC
Actually never mind my last two comments. Turns out that the difference was in early and late start KMS. So it seems like kernel upgrade did not change anything either.
Comment 14 Gary 2010-01-19 15:42:27 UTC
Has anyone had success by reducing color depth to 16?  I just did that, and I got rid of (when runnting glxgears):

*********************************WARN_ONCE*********************************
File radeon_tcl.c function radeon_run_tcl_render line 499
Rendering was 405 commands larger than predicted size. We might overflow  command buffer.
***************************************************************************

Honestly, I'm afraid to run googleearth.  I've fsck'ed the system so many times ... I'm worried that my nine lives may have run out.  :)
Comment 15 Gary 2010-01-19 16:01:48 UTC
Additional Comment to #14:
Here's the relevant part of my xorg.conf

Section "Monitor"
        Identifier   "Monitor0"
        VendorName   "Monitor Vendor"
        ModelName    "Monitor Model" 
EndSection                           

Section "Device"
        Identifier  "Card0"                                          
        Driver      "radeon"                                         
        VendorName  "ATI Technologies Inc"                           
        BoardName   "Radeon Mobility M7 LW [Radeon Mobility 7500]"   
        BusID       "PCI:1:0:0"                                      
        Option          "AccelMethod"           "EXA"                
        Option          "DRI"                   "true"               
EndSection                                                           


Section "Screen"
        Identifier      "Screen0"
        Monitor         "Monitor0"
        Device          "Card0"   

        Defaultdepth    16
        Virtual         1024 768
EndSection
Comment 16 Simon Csaba Endre 2010-02-20 11:38:47 UTC
Same corrupted filesystem problem with Radeon Mobility 7500 on a Thinkpad T42 laptop:
kernel - 2.6.32
mesa - 7.7
xf86-video-ati - 6.12.4
ati-dri - 7.7

bpp 16 will cause a very dim display but with a fully bright cursor and still the problem is there.
bpp 15 the problem is gone but video players displays only a green window.

IMHO this is a SEVERE bug and a big regression if an user space graphical application overwritting the memory causing filesystem corruption can put on knee the whole system.
Comment 17 Gary 2010-02-21 18:34:23 UTC
The Severity of this bug should be changed from Normal to at least High, as it causes both system crashes and data loss.  Perhaps it should be changed to Blocking, as it is impossible to do any 3D graphics development on affected systems. I develop apps that use 3D graphics, so I am blocked.  AFAICT, someone with authority must do this.

After one month of tip-toeing around in my system, taking care not to use anything that uses 3D, yesterday I "accidently" created a 3D plot; this bug corrupted my file system, the computer crashed, I had to run fsck from a liveCD. I am a scientist, and I need 3D capabilities.  I'm considering a switch to BSD, but that sounds painful.
Comment 18 Henrique de Moraes Holschuh 2010-02-21 18:52:50 UTC
Yes, it should have a higher priority.  But setting it would be pointless, since it is not even assigned to anyone: it would not make anyone move faster to fix the problem.

I would raise it anyway, but I don't have enough 'bugzilla powers' to do it.

What you can do is to file bugs in the *distros* for which there are not any bugs about this yet, and mark *those* as critical (causes filesystem corruption) and blocking.  That might save someone's data.
Comment 19 Alan 2010-02-21 19:46:39 UTC
It's assigned to the non Intel DRI list, which means ATI should be picking up on it. Unfortunately bugzilla has no magical powers to make them care or notice it. Your distro however may well do, or may well employ folks working on this..
Comment 20 Dave Airlie 2010-02-21 19:55:41 UTC
I've tried to reproduce this locally and haven't had any luck using Fedora running in user mode setting. It would help if someone could try with an old kernel and see it still happens.

Alternatively can someone test this with

git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6.git drm-radeon-testing

We haven't touched the non-kms radeon driver much, the only thing that seems to be happening is mesa is getting better at creating fully used command buffers and the kernel has gotten worse at giving out large kmallocs (64k), there are fixes for this in drm-radeon-testing to avoid the larger mallocs which will hopefully help in some way, I'll see if I can spot a codepath where we send crap to the GPU, but we currently test the offsets userspace gives us and validate them for crap to avoid just this.
Comment 21 Julien Cristau 2010-02-22 13:44:21 UTC
(In reply to comment #20)
> I've tried to reproduce this locally and haven't had any luck using Fedora
> running in user mode setting. It would help if someone could try with an old
> kernel and see it still happens.
> 
The user at http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=550562#136 reports kernel BUGs with 2.6.26 and mesa 7.6 (log excerpt at http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=136;filename=kernel.oops;att=1;bug=550562).
I've asked him to try drm-radeon-testing.
Comment 22 Dave Airlie 2010-02-23 00:35:20 UTC
is anyone who is seeing this using page flipping?

I'm still not having any luck here reproducing it on an T42 with mesa 7.6.1 from the branch, 2.6.32.3 kernel
Comment 23 Dave Airlie 2010-02-23 01:32:29 UTC
It would be really nice if the people seeing hangs and crashes that aren't

(a) filesystem corruption
(b) low memory corruption

could not talk about it any more.

Then it would really help if we can get someone running

git version of the kernel (prefereably from drm-radeon-testing)
libdrm/mesa from git master
ati ddx from git master

to reproduce either issue a or b with open software (i.e. Windows games aren't something I own). Then tell me what you did and I'll go figure it out.
Comment 24 Gary 2010-02-23 15:13:46 UTC
I'm more than willing to help, but I need some help with implementation and reporting.  And it'll take a couple of days to find time.

Me, summary:  Kubuntu 9.10,  Thinkpad T41, Radeon Mobility 7500. Excerpt from xorg.conf in Comment #15 above. I've tried mesa 7.6.0, 7.7, and 7.5, and kernels 2.6.32, 2.6.31-19, 2.6.31-14.  Mostly from packages that someone else compiled, and not every possible combination.  I get file system corruption almost always.  The exception: Mesa 7.5 appears to be corruption-free, but it introduced something annoying but unrelated to this issue ... can't recall what right now.  I'm considering downgrading to 7.5 again.

1. Page flipping:  I'm using EXA, so I think that means no page flipping.
2. Need an open app?  I've gotten crashes with just about every 3D program I've tried.  For testing, I've been using google earth and vpython (www.vpython.org). I'll find another if you want.
Comment 25 Dave Airlie 2010-02-24 00:50:00 UTC
okay I've reproduced it with Google earth thanks to Gary for pointing it out.

I've tested on drm-radeon-testing and it doesn't happen so the upstream fixes for the buffer allocation failing must have fixed it however that patch is majorly intrusive so I'll see if I can actaully figure out what is going wrong and fix that for stable.
Comment 26 Dave Airlie 2010-02-24 07:23:27 UTC
Created attachment 25180 [details]
kernel patch to block bad behaviour

Block the badness from mesa.
Comment 27 Dave Airlie 2010-02-24 07:24:12 UTC
please test the patch in #26, it shuold fix it, not even d-r-t fixes it as previously reported.
Comment 28 Simon Csaba Endre 2010-02-24 07:33:28 UTC
Dave also xscreensaver-demo is causing file-system corruption. Is this kernel patch fixing this issue also? Does this patch will be applied to the 2.6.32 kernel? As Arch Linux is a rolling distro I can try it if so.
Comment 29 Dave Airlie 2010-02-24 09:25:42 UTC
mesa fixes are now in the mesa master and mesa 7.7 branches.

Simon in comment #28, yes any GL app that causes low memory corruption should be fixed by this. It'll go in to stable kernel as well once I push it.
Comment 30 Henrique de Moraes Holschuh 2010-02-24 17:16:54 UTC
Ouch.

Dave,

Is there any possibility of userspace-generated content (GL requests, textures, whatever) influence the bogus DMA details, so that it could be used to, e.g., overwrite specific areas of memory?
Comment 31 Alex Deucher 2010-02-24 17:24:59 UTC
The drm has a command stream checker to keep this from happening, but there can be obscure cases like this one that are not immediately obvious.
Comment 32 Atif Amin 2010-05-09 15:05:50 UTC
give me the drum -s report so that i can investigate internal of hardware structure...

Your DMA controller might be doing some wrong work..
waiting for your Response
Comment 33 razamatan 2011-07-21 08:42:45 UTC
i just noticed this behavior on 2.6.38 and 2.6.39 from my gentoo box.  i have an rs690 chipset, run kms, use the radeon kernel drivers and run compiz on xorg.

b/c i run compiz, all i have to do is be in an X session and do stuff and eventually my root partition (ext3, but moved to ext4 after trying to reformat after a full wipe of the disk... i was worried the disk was going bad) gets corrupted.  i'm completely new to debugging issues at this level, so please advise for what you guys need.
Comment 34 Alan 2012-06-14 16:43:57 UTC
Closing as fixed as the original bug filed here was fixed

razamatan - if your bug is still present please open a new bug