A new investigation by David Kanter at Realworldtech adds to the pile of circumstantial evidence that NVIDIA has apparently crippled the performance of CPUs on its popular, cross-platform physics acceleration library, PhysX. If it's true that PhysX has been hobbled on x86 CPUs, then this move is part of a larger campaign to make the CPU—and Intel in specific—look weak and outdated. The PhysX story is important, because in contrast to the usual sniping over conference papers and marketing claims, the PhysX issue could affect real users.
We talked to NVIDIA today about Kanter's article, and gave the company a chance to air its side of the story. So we'll first take a look at the RWT piece, and then we'll look at NVIDIA's response.
Oh my God, it's full of cruft
When NVIDIA acquired Ageia in 2008, the GPU maker had no intention of getting into the dedicated physics accelerator hardware business. Rather, the game plan was to give the GPU a new, non-graphics, yet still gaming-oriented advantage over the CPU and over ATI's GPUs. NVIDIA did this by ditching Ageia's accelerator add-in board and porting the platform's core physics libraries, called PhysX, to NVIDIA GPUs using CUDA. PhysX is designed to make it easy for developers to add high-quality physics simulation to their games, so that cloth drapes the way it should, balls bounce realistically, and smoke and fragments (mostly from exploding barrels) fly apart in a lifelike manner. In recognition of the fact that game developers, by and large, don't bother to release PC-only titles anymore, NVIDIA also wisely ported PhysX to the leading game consoles, where it runs quite well on console hardware.
If there's no NVIDIA GPU in a gamer's system, PhysX will default to running on the CPU, but it doesn't run very well there. You might think that the CPU's performance deficit is due simply to the fact that GPUs are far superior at physics emulation, and that the CPU's poor showing on PhysX is just more evidence that the GPU is really the component best-equipped to give gamers realism.
Some early investigations into PhysX performance showed that the library uses only a single thread when it runs on a CPU. This is a shocker for two reasons. First, the workload is highly parallelizable, so there's no technical reason for it not to use as many threads as possible; and second, it uses hundreds of threads when it runs on an NVIDIA GPU. So the fact that it runs single-threaded on the CPU is evidence of neglect on NVIDIA's part at the very least, and possibly malign neglect at that.
But the big kicker detailed by Kanter's investigation is that PhysX on a CPU appears to exclusively use x87 floating-point instructions, instead of the newer SSE instructions.
x87 = old and busted
The x87 floating-point math extensions have long been one of the ugliest legacy warts on x86. Stack-based and register-starved, x87 is hard to optimize and needs more instructions and memory accesses to accomplish the same task than comparable RISC hardware. Intel finally fixed this issue with the Pentium 4 by introducing a set of SSE scalar, single- and double-precision floating-point instructions that could completely replace x87, giving programmers access to more and larger registers, a flat register file (as opposed x87's stack structure), and, of course, floating-point vector formats.
Intel formally deprecated x87 in 2005, and every x86 processor from both Intel and AMD has long supported SSE. For the past few years, x87 support has been included in x86 processors solely for backwards compatibility, so that you can still run old, deprecated, unoptimized code on them. Why then, in 2010, does NVIDIA's PhysX emit x87 instructions, and not scalar SSE or, even better, vector SSE?
NVIDIA: the bottlenecks are elsewhere
We spent some time talking through the issue with NVIDIA's Ashutosh Rege, Senior Director of Content and Technology, and Mike Skolones, Product Manager for PhysX. The gist of the pair's argument is that PhysX games are typically written for a console first (usually the PS3), and then they're ported to the PC. And when the games go from console to PC, the PC runs them faster and better than the console without much, if any, optimization effort.
"It's fair to say we've got more room to improve on the CPU. But it's not fair to say, in the words of that article, that we're intentionally hobbling the CPU," Skolones told Ars. "The game content runs better on a PC than it does on a console, and that has been good enough."
NVIDIA told us that it has never really been asked by game developers to spend any effort making the math-intensive parts of PhysX faster—when it gets asked for optimization help, it's typically for data structures or in an area that's bandwidth- or memory-bound.
"Most of the developer feedback to us is all around console issues, and as you can image that's the number one consideration for a lot of developers," Skolones said.
Even after they made their case, we were fairly direct in asking them why they couldn't do a simple recompile and use SSE instead of x87. It's not that hard, so why keep on with the old, old, horrible x87 cruft?
The answer to this was twofold: first and least important, is the fact that all the AAA developers have access to the PhysX source and can (and do) compile it for x87 on their own.
The second answer was more important, and surprised me a bit: the PhysX 2.X code base is so ancient (it goes back to well before 2005, when x87 was deprecated), and it has such major problems elsewhere, that they were insisting that it was just kind of pointless to change over to SSE.
When you're talking about changing a compiler flag, which could have been done at any point revision, the combination of "nobody ever asked for it, and it wouldn't help real games anyway because the bottlenecks are elsewhere" is not quite convincing. It never occurred to anyone over the past five years to just make this tiny change to SSE? Really?
Of all the answers we got for why PhysX still uses x87, the most convincing ones were the ones rooted in game developer apathy towards the PC as a platform. Rege ultimately summed it up by arguing that if they weren't giving developers what they wanted, then devs would quit using PhysX; so they do give them what they want, and what they want are console optimizations. What nobody seems to care about are PC optimizations like non-crufty floating-point and (even better) vectorization.
"It's a creaky old codebase, there's no denying it," Skolones told Ars. "That's why we're eager to improve it with 3.0."
Wait for 3.0
It's rare that you talk to people at a company and they spend as much time slagging their codebase as the NVIDIA guys did on the PC version of PhysX. It seemed pretty clear that PhysX 2.x has a ton of legacy issues, and that the big ground-up rewrite that's coming next year with 3.0 will make a big difference. The 3.0 release will use SSE scalar at the very least, and they may do some vectorization if they can devote the engineering resources to it.
As for how big of a difference 3.0 would bring for PhysX on the PC, we and NVIDIA had divergent takes. Rege expressed real skepticism that a combination of greater multithreading, SSE scalar, and vectorization would yield a 2X performance boost for the CPU on the specific kernels that Kanter tested. We don't know those kernels very well, but our intuition tells us that a 2X boost shouldn't be unreasonable. Intel and AMD have spent a lot of effort in the past few years, both on the hardware side and in their compilers, making SSE execute very quickly—and none at all on x87.
Even if there was a 2X speedup to be had on a set of test kernels, that wouldn't translate to a 2X speedup in game performance. Individual frames vary greatly in how much physics processing is going on, depending on what's happening in the scene. So it's very hard to say what kind of average speedup an all-out optimization effort would deliver, which makes it even harder to speculate about 3.0.
It's also the case that the PC is the least sexy gaming platform that PhysX supports. When the list of PhysX platforms includes the iPhone and all the consoles, it's easy to imagine that both developers and NVIDIA itself spend the majority of their effort elsewhere.
But still, when you boil it all down, we keep coming back to the point that it's so easy to switch from x87 to SSE, and x87 has been deprecated for so long, and it's so much to NVIDIA's advantage to be able to tout the GPU's superiority over the CPU for physics, that it's very hard to shake the feeling that there's some kind of malicious neglect going on. Think about it: if game developers really don't care that much about PC physics performance, and it's "good enough" with x87 code, why make a small change that might give the CPU an unneeded boost? Why not just let it struggle along at "good enough"?
Source : https://arstechnica.com/gaming/2010/07/did-nvidia-cripple-its-cpu-gaming-physics-library-to-spite-intel/