Bug 8633 (Flaky-EDID-Fetch) - Fetch of EDID 128 byte buffer by X server through vm86 INT 10 call is flaky.
Summary: Fetch of EDID 128 byte buffer by X server through vm86 INT 10 call is flaky.
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: Flaky-EDID-Fetch
Product: Platform Specific/Hardware
Classification: Unclassified
Component: i386 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-06-15 12:30 UTC by William Cattey
Modified: 2007-11-08 05:52 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.18-8 and 2.6.20 have been tested.
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Patch to disable call to audit_syscall_exit (538 bytes, patch)
2007-07-10 16:09 UTC, William Cattey
Details | Diff

Description William Cattey 2007-06-15 12:30:58 UTC
Most recent kernel where this bug did not occur: 2.6.9

Distribution: Red Hat Enterprise 5, and Ubuntu 7.04.

Hardware Environment:  Dell Optiplex 745 with ATI Radeon X1300 pro.

Software Environment: Stock RHEL 5, Stock Ubuntu 7.04.

Problem Description: The symptom is that the X server does not properly configure the video monitor.  This sort of thing gets reported to all kinds of people all the time.  This time it appears to be a kernel issue.

The EDID transfer that fetches the 128 bytes that describes the monitor is sometimes successful, sometimes all zeros and sometimes a partial transfer where the rest of the buffer (preset to 0x13 to monitor) is over-written with zeros.

Using get-edid from the read-edid package is always successful.  :-(

At first I thought this was due to something introduced in the newer version of the X server, but I managed to run the X server from RHEL 4 (with the 2.6.9 kernel) which never failed to run under RHEL 5 with its 2.6.18 kernel. The log file of successive runs show an EDID transfer either all zeros, or partially successful and padded out with zeros with the old happy X server under the new kernel.

The EDID transfer is initiated by the stock VESA X server, rather than any proprietary module.  It appears to be a race condition, or a caching issue.

I have bugs open with Ubuntu and Red Hat, but inasmuch as this bug seems to go across versions, I wanted to ask for help here.

I apologize for not properly classifying this bug initially, and for coming to you without having done as much as one should do before coming to xorg. I learned how to build a debugging X server for the first time to get it this far.  If some kind soul will give me a pointer to step-by-step instructions, I'll do more to isolate, identify, and rectify this problem myself.

Steps to reproduce:

On any hardware with ATI Radeon X1300 or later video hardware, with the default VESA X server and the kernel 2.6.18 or later, examine the Xorg.0.log output for the section that performs the VESA VBE DDC transfer.

After the lines:
    (II) VESA(0): VESA VBE DDC Level 2
    (II) VESA(0): VESA VBE DDC transfer in appr. 1 sec.
    (II) VESA(0): VESA VBE DDC read successfully
If the next line is:
    (II) VESA(0): Searching for matching VESA mode(s):

Then the transfer was all zeros and ignored without even complaining of bad data.  (I've submitted a patch upstream about that.)

If the transfer was partially or wholly successful you will eventually see something like:

(II) VESA(0): EDID (in hex):
(II) VESA(0):   00ffffffffffff0010ac15a055483137
(II) VESA(0):   1710010368261e78eeee91a3544c9926
(II) VESA(0):   0f5054a54b00714f8180010101010101
(II) VESA(0):   010101010101302a0098510000000000
(II) VESA(0):   00000000000000000000000000000000
(II) VESA(0):   00000000000000000000000000000000
(II) VESA(0):   00000000000000000000000000000000
(II) VESA(0):   00000000000000000000000000000000

The above indicates a partial transfer.

Successive runs of the X server will produce different contents for the DDC transfer.

The downside to this bug is that it's a nasty race condition in a part of the kernel that people don't like to debug -- fetching BIOS data through the vm86 stuff.  Most people expect it simply does not work and advocate use of the vendor proprietary drivers, or avoiding the hardware completely until a reverse-engineered driver comes out.

Is there someone out there who knows how this stuff is supposed to work who can give me some instruction on how to cut this thing apart and find out what's really going on?

Thanks in advance,

-Bill Cattey
Former Team Leader MIT Athena development team.
Now Linux Platform Coordinator
Massachusetts Institute of Technology.

(I'm supposed to be the manager and not get my hands dirty with the technical stuff, but it's killing the hardware we want to buy, and nobody else seems to willing to dig into it.)
Comment 1 William Cattey 2007-07-10 16:09:10 UTC
Created attachment 11995 [details]
Patch to disable call to audit_syscall_exit

A friend helped me learn how to build a kernel.  He also bench-checked the vm86.c code, and suggested that perhaps the registers required for reliable int10 emulation were being trashed by the call to audit_syscall_exit.

The attached patch disables the call to audit_syscall_exit.

With this patch applied, error messages complaining about freeing multiple contexts return, but the EDID data transfers are once again 100% reliable from inside the X server.

If someone familiar with the change to call audit_syscall_exit could re-examine that change, and supply an amended vm86.c, we'd be happy to test it.  We didn't presume to understand things well enough to propose a better placement of the call.
Comment 2 Natalie Protasevich 2007-08-26 00:27:32 UTC
Can you confirm the problem is still in 2.6.22+ and if so attach your boot log (dmesg) and dmidecode output if possible.
If this problem wasn't happening with 2.6.9, you might have to try git bisect to find changes that caused it.
Thanks.
Comment 3 Natalie Protasevich 2007-11-07 23:38:41 UTC
Bill, I think I found someone to look into the vm86. I suppose the problem is still there with latest kernel?
Comment 4 William Cattey 2007-11-08 05:33:46 UTC
Thanks very much.
We've been working this bug with Red Hat at:

https://bugzilla.redhat.com/show_bug.cgi?id=254024

The update since 8/26 is that the 2.6.21 kernel does NOT have the problem.
Our current operating theory is that a change introduced at 2.6.20 remedied the problem:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=49d26b6eaa8e970c8cf6e299e6ccba2474191bf5

It looks like kernel.org need not worry about this bug.
Now it's up to Red Hat to decide if a back port of stuff in 2.20 or later to their Enterprise kernel (2.6.18 baseline) is possible, feasable, sensible.

I apologize for not updating this bug with the relevant status.
Comment 5 Natalie Protasevich 2007-11-08 05:52:52 UTC
Great, thanks for the update!

Note You need to log in before you can comment on or make changes to this bug.