Mass Effect is a popular franchise of sci-fi roleplaying games. The first game was initially released by BioWare in late 2007 on Xbox 360 exclusively as a part of a publishing deal with Microsoft. A few months later in mid-2008, the game received PC port developed by Demiurge Studios. It was a decent port with no obvious flaws, that is until 2011 when AMD released their new Bulldozer-based CPUs. When playing the game on PCs with modern AMD processors, two areas in the game (Noveria and Ilos) show severe graphical artifacts.
[…]What makes this issue particularly interesting? Vendor-specific bugs are nothing new, and games have had them for decades. However, to my best knowledge, this is the only case where a graphical issue is caused by a processor and not by a graphics card. In the majority of cases, issues happen with a specific vendor of GPU and they don’t care about the CPU, while in this case, it’s the exact opposite. This makes the issue very unique and worth looking into.
An extremely detailed look into the analysis and fix for this very specific bug – and a download with the fix, of course.
Well, it’s an interesting article and props to them for getting it working. I wish they dug a little further to identify the exact difference between D3DXMatrixInverse and XMMatrixInverse on AMD CPUs. They speculate that it could be a precision issue, but even rounding differences doesn’t clearly explain why the rendering goes black… Their job is done and they wrote a patch to avoid the faulty branch, but it seems like they got too tired of debugging to solve the mystery of the faulty branch, haha.
Given this is lighting we’re talking about, I’m willing to bet there’s a reciprocal square root estimation being done there somewhere that is being fed values close to black (0) that are going slightly negative on AMD SSE2. This would give a NaN when the root is done. Reciprocal square root estimation has historically been done with an eye toward speed rather than accuracy. Maybe the new code validates the inputs all positive before doing the estimate. No one validated inputs originally since that could be a major hit on some systems. If you look at the old Doom code as an example, the inputs to the column and segment rendering had conditional compiling on code to check the input parameters; it is normally disabled for the release binary as opposed to the debug binary.
JLF65,
Maybe, it’s still speculative though. By far the most obvious place to have square roots in game engines is for distance computation where the inputs are squared anyways and thus cannot be negative. Even allowing for numerical approximation, negative numbers would be highly suggestive of a bug somewhere. It’s really weird that mass effect causes this bug with D3DXMatrixInverse on AMD and nothing else (that we know of) does. I’m definately curious what’s happening, but ultimately it would take more specific input & output debugging to know for sure.
Inverse square roots are extremely common in lighting calculations. It’s also a pretty common classification of floating-point mistake that you do something like this:
float foo(float in){
if(in == 4.0f)
return 0.0f;
else
1.0f / ((in / 2.0f) – 2.0f);
}
In which case certain values will still cause a divide-by-zero. You can imagine other cases where it could end up resulting in a slightly negative number, etc.
FlyingJester,
Is that just a random function or is it supposed to represent something specific?
As you probably already know, in practice using equality on floating point variables as that code snippet does is looking for trouble…
https://stackoverflow.com/questions/17404513/floating-point-equality-and-tolerances
The algorithm, compiler, optimization, SSE, etc can change the results of floating point equality. IMHO one should never use equality on floating point without taking extreme care and factoring in the underlying architectural implementation.
https://en.wikipedia.org/wiki/Extended_precision
Anyways, obviously I get why divide by zero can be a problem, but as it relates to this article, if there were divisions by zero, D3DXMatrixInverse returning NAN would make sense for both AMD and Intel. Hypothetically maybe a floating point comparison was used and could be the source of this bug…? Or maybe the AMD implementation returns NAN whereas intel returns +/-INF? Alas, we’d be in a better place to answer questions like this if the author had provided more output.
Alfman: It can work with inequality too, in . Imagine that it’s a sqrt() instead of a reciprocal.
FlyingJester,
Ah, so you meant greater than & less rather than equality. So in effect asking whether these two lines of code are identical…
Well, we can consider the edge cases.
Can you think of any other in-range numbers for x and y such that the lines are not equal? All the inrange numbers I tried worked (ie matched) as expected.
I suspect all the values in a real game engine would be finite in nature and even if it were possible to produce infinite numbers it isn’t really clear to me why intel wouldn’t also experience it. I think it’s too convenient to blame precision. The glitch, whatever it is, happens 100% of the time and not just for colors below a certain value or something like that. I really think something else is amiss, but again without data it’s just speculation.
Another reminder that despite Windows 8.1 with secdrv.sys enabled is theoretically bug-to-bug compatible with Windows 2000 and earlier, incompatibilities related to hardware do come up, but often result in frame-pacing issues, not graphical glitches.
earlier = later