From the trenches: about multicore optimizations in ArmA 2
Let us take a look at how multicore is used in ArmA 2.
AI path finding was already asynchronous in ArmA, with multiple path finding jobs active at the same time, and AI units never waiting for the jobs to complete.This made it an attractive target for threading optimization, and really this have brought significant results very soon.
With much improved combat AI in ArmA 2 (AI searching for cover, AI avoiding lines of possible enemy fire) those jobs now get a lot more CPU time then before, with no performance impact.
Picture 1: Pathfinding in the Utes village
Rendering is still the part of the game which takes most of the CPU time in ArmA, more than simulation or AI. To significantly optimize it, we have evaluated two possible approaches:
- Separate rendering thread
- Parallelism of selected parts of rendering
The first approach is often assumed as the best way how to handle rendering. There is one thing which makes this approach attractive:
- the data flow is one direction only, there are no data flowing back from rendering into the simulation, therefore implementing it is relatively simple
However, there are downsides as well:
- there can be huge amounts of data passing from simulation to rendering. In a single core solution you usually share the data. As the simulation and rendering become independent, they need to pass this data somehow and overhead of passing them between threads can be significant, as it usualy requires a lot of copying (memory traffic)
- this architecture increases latency (response lag), as it does not reduce the time it takes from the simulation to the moment the image is displayed
- does not scale to more cores
What we eventually ended up with is a hybrid approach, which uses a worker thread for rendering, but some important inner loops are made concurrent across multiple cores.
Picture 2: ArmA 2 threading displayed using custom in-game debugging tool, captured on Quad Core machine
Hint: Roll the mouse over the image to see explanation
Goal and Means
It is important not to lose the sights from the goal, which is the performance increase. All other things are secondary. One example of wrong metrics is a concurrency level. Concurrency level tells us how much are the additional cores used. This factor is very easy to measure (you can do it in default system task manager), and that is probably why many hard core end users and reviewers are interested about it. Often you can see phrases like "Game XXXX is using quad cores very well, because when you watch CPU usage in task manager, you see all cores are running 100 %". It is very easy to create a trivial program which will make "full use of all cores" - all you need to do it to spawn a few threads and make them spin in an infinite loop. Concurrency is not a goal, only a mean. It is required, but not sufficient. Real life scenarios are more intricated then idle loops, but the principle is the same: using CPU does not mean you get any benefit from using it. In many cases the overhead of going "threaded" is so high that even when two cores are running 100 %, the performance improvement is very small, say about 20 % from single core, and the difference between quad and dual is even smaller.
ArmA 2 gets following improvements from running on a dual/multi core:
- improved rendering performance
- smarter AI
- larger scenes possible (higher view distance, more objects in view, more AI units) with little performance drop, especially on multi core machines
Herb Sutter: The Free Lunch Is Over - a classical now introduction
Valve goes multicore - how Valve is handling multi core
.mischief.mayhem.soap. - "Intel Thread Profiler" - blog post sharing multicore experiences