More parallelisation fun: Conway's Game of Life

Okay, so I've probably exhausted the Mandelbrot set as a source of interest, at least for the moment. However, every so often someone mentions something or other which sounds like a fun exercise in parallelisation. The latest one is Conway's Game of Life. Like the Mandelbrot set, this is something I used to play with when I was a teenager - and like the Mandelbrot set, it's an embarrassingly parallel problem.

As before, I've written a few implementations but not worked excessively hard to optimise. All my tests were performed with a game board of 1000x500 cells, displaying at one pixel per cell (partly because that was the easiest way to draw it). I found that with the slower implementations the rendering side was irrelevant to the results, but the rendering became a bottleneck with the fast algorithms, so I've included results both with and without rendering.

The "benchmark" code (I use the word very lightly - normal caveats around methodology apply, this was only a weekend evenings project after all) records the "current" number of frames per second roughly once per second, and also an average over the whole run. (The fast algorithm jump around in fps pretty wildly depending on what's going on - the more naïve ones don't really change much between a busy board and a dead one.) The starting pattern in all cases was the R-pentomino (or F-pentomino, depending on whose terminology you use) in the middle of the board.

A base class (BytePerPixelBoard) is used by all the implementations - this represents the board with a single byte per pixel, which is really easy to work with but obviously very wasteful in terms of memory (and therefore cache). I haven't investigated any other representations, but I wanted to get this post out fairly quickly as I have a lot of other stuff to do at the moment. (I also have a post on Java closures which I should tidy up soon, but hey...)

Implementation smells

All the source code is available for download, of course. There are no fancy options for verifying the results or even turning off rendering - just comment out the line in Program.BackgroundLoop which calls Render.

It would be fairly easy to make ParallelFastInteriorBoard derive from SimpleFastInteriorBoard, and ParallelActiveBlockBoard derive from ActiveBlockBoard, but I've kept the implementations completely separate for the moment. That means a huge amount of duplication, of course, including a nested class in each of the "active block" implementations - the nested classes are exactly the same. Oh, and they use internal fields instead of properties. I wanted the fields to be read-only, so I couldn't use automatic properties - but I didn't want to go to the hassle of having explicitly declared fields and separate properties. Trust me, I wouldn't do this in production code. (There were some other internal fields, but they've mercifully been converted into properties now.)

SimpleBoard

This is about as simple as you can get. It fetches each cell's current value "carefully" (i.e. coping with wrapping) regardless of where it is on the board. It works, but it's almost painfully slow.

SimpleFastInteriorBoard

This is just an optimisation of SimpleBoard. We fetch the borders of the board "slowly" as per SimpleBoard, but once we're on the interior of the board (any cell that's not on an edge) we know we won't need to worry about wrapping, so we can just use a fixed set of array offsets relative to the offset of the cell that's being calculated.

This ends up being significantly faster, although it could still be optimised further. In the current code each cell is tested for whether it's an interior cell or not. There's a slight optimisation by remembering the results for "am I looking at the top or the bottom row" for a whole row, but by calculating just the edges and then the interior (or vice versa) a lot of tests could be avoided.

ParallelFastInteriorBoard

Behold the awesome power of Parallel.For. Parallelising SimpleFastInteriorBoard literally took about a minute. I haven't seen any toolkit other rival this in terms of ease of use. Admittedly I haven't done much embarrassingly parallel work in many other toolkits, but even so it's impressive. It'll be interesting to see how easy it is to use for complex problems as well as simple ones.

If you look at the results, you'll see this is the first point at which there's a difference between rendering and not. As the speed of the calculation improves the constant time taken for rendering becomes more significant.

ActiveBlockBoard

This is a departure in terms of implementation. We still use the same backing store (one big array of a byte per cell) but the board is logically broken up into blocks of 25x25 cells. (The size is hard-coded, but only as two constants - they can easily be changed. I haven't investigated the effect of different block sizes thoroughly, but a few ad hoc tests suggests this is a reasonable sweet spot.) When considering a block in generation n, the changes (if any) in generation n-1 are considered. If the block itself changed, it may change again. If the relevant borders of any of its neighbours changed (including corners of diagonally neighbouring blocks), the cell may change. No other changes in the board can affect the block, so if none of these conditions are met the contents of generation n-1 can be copied to generation n for that block. This is obviously a massive saving when there are large areas of stable board, either dead or with stable patterns (such as 2x2 live blocks). It doesn't help with blinkers (3x1 blocks which oscillate between vertical and horizontal alignments) or anything similar, but it's still a big optimisation. It does mean a bit more house-keeping in terms of remembering what changed (anywhere in the cell, the four sides, and the four corners) each generation, but that pales into insignificance when you can remove the work from a lot of blocks.

Again, my implementation is relatively simple - but it still took a lot of work to get right! The code is much more complicated than SimpleFastInteriorBoard, and indeed pretty much all the code from SFIB is used in ActiveBlockBoard. Again, I'm sure there are optimisations that could be made to it, and alternative strategies for implementing it. For instance, I did toy with some code which didn't keep track of what had changed in the block, but instead pushed changes to relevant neighbouring blocks, explicitly saying, "You need to recalculate next time!" There wasn't much difference in speed, however, and it needed an extra setup stage in order to make the code understandable, so I've removed it.

The performance improvement of this is staggering - it makes it over 20 times faster than the single-threaded SimpleFastInteriorBoard. At first the improvement was much smaller - until I realised that actually I'd moved the bottleneck to the rendering, and re-benchmarked with rendering turned off.

ParallelActiveBlockBoard

Exactly the same implementation as ActiveBlockBoard, but using a Parallel.ForEach loop. Again, blissfully easy to implement, and very effective. It didn't have the same near-doubling effect of SimpleFastInteriorBoard to ParallelFastInteriorBoard (with rendering turned off) but I suspect this is because the amount of housekeeping work required to parallelise things starts becoming more important at the rates shown in the results. It's still impressive stuff though.

Results

Implementation FPS with rendering FPS without rendering
SimpleBoard 4 4
SimpleFastInteriorBoard 15 15
ParallelFastInteriorBoard 23 29
ActiveBlockBoard 78 326
ParallelActiveBlockBoard 72 508

Conclusions

  • The Parallel Extensions library rocks my world. Massive, massive kudos to Joe Duffy and the team.
  • I have no idea why ParallelActiveBlockBoard is slower than ActiveBlockBoard when rendering is enabled. This was repeatable (with slightly different results each time, but the same trend).
  • Always be aware that reducing one bottleneck may mean something else becomes your next bottleneck - at first I thought that ActiveBlockBoard was "only" 5 times faster than SimpleFastInteriorBoard; it was only when parallelisation showed no improvement (just the opposite) that I started turning rendering off.
  • Changing the approach can have more pronounced effects than adding cores - moving to the "active block" model gave a massive improvement over brute force.
  • Micro-optimisation has its place, when highly targeted and backed up with data - arguably the "fast interior" is a micro-optimisation, but again the improvement is quite dramatic.
  • Benchmarks are fun, but shouldn't be taken to be indicative of anything else.
  • Did I mention that Parallel Extensions rocks?
Published Sun, Jun 1 2008 20:22 by skeet
Filed under: ,

Comments

# Reflective Perspective - Chris Alcock » The Morning Brew #105

Pingback from  Reflective Perspective - Chris Alcock  » The Morning Brew #105

# re: More parallelisation fun: Conway's Game of Life

Okay, read through your two recent parallelism entries.  Interesting stuff.  But how do you hold down your Google job and still have time left for this stuff?  Are there parallel Jon Skeets too?

Anyway...not much of real value to add, but figured I'd comment anyway on a few of things (lack of value never stopped me before):

I'm particularly impressed that using the Parallel Extensions results in <i>better</i> performance than a hand-coded version of what one would otherwise guess would be an identical implementation.  Specifically, in your Mandelbrot test, comparing the "MultiRow..." versions with the Parallel.For and PLINQ versions (the "generator" versions notwithstanding :) ).  It makes me wonder if the Parallel Extensions uses a faster synchronization mechanism than the Monitor class you're using (for example, rather than using a queue, perhaps they just use an interlocked-increment to iterate through the rows).

Which doesn't negate the point of using the Parallel Extensions.  If anything, it illustrates just how important it is to let the experts do the stuff they're good at, rather than trying to become an expert in every field applicable to one's program implementation.  Of course, the Parallel Extensions stuff also carries the same benefit all frameworks have, which is that performance improvements wind up granted to client code "for free" when they appear.

As far as this comment of yours (why no "preview" button so I can see whether my attempts to use HTML formatting will work? :) ):

<blockquote>I have no idea why ParallelActiveBlockBoard is slower than ActiveBlockBoard when rendering is enabled. This was repeatable (with slightly different results each time, but the same trend).</blockquote>

I wonder if you're getting hit with a refresh rate effect.  I didn't actually look at the code, so I'm completely ignorant as to how you're rendering.  But sometimes subtle timing differences can create modalities where code that's otherwise faster winds up hit harder by display synchronization.  Sorry for the hand-waving...no concrete examples, it's late here, just thought I'd toss that out in case it's of interest.  :)

And:

<blockquote>Changing the approach can have more pronounced effects than adding cores - moving to the "active block" model gave a massive improvement over brute force.</blockquote>

This is basically the same point Alun was making with the Mandelbrot set (you did acknowledge it there too, so I'm not being accusatory here :) ).  I remember seeing a Mac Mandelbrot screen-saver (1990, 1991?) that took advantage of the characteristic that Alun was describing.  It chunked the coordinate space into sections with relative sizes proportional to the Mandelbrot set's own "lobes", such that initially the set was divided horizontally in half, but vertically so that the larger lobe was entirely separated from the rest.

The slowest computations are, of course, where the points are considered "in" and thus reach the maximum number of iterations.  But these are also the points that one can identify when they are entirely enclosed by other points that are "in".  It was interesting watching this screen-saver manage to lop off entire parts of the problem just by scanning rectangles enclosing potential "prunable" areas.

Anyway, that's a long way of saying that, yes...parallelism is great, and having an API that allows one to very easy take advantage of parallelism, especially in a high-performance way, is also great.  But it's just as important to keep in mind the specifics of the problem.

And on that topic: I feel like the "interior vs. edge cell" optimization is right on the border between "micro" and simply taking advantage of some basic aspect of the problem.  I guess I'd have to give the nod to "micro", but only because it's something that comes out of the inherent nature of the <i>implementation</i> rather than the problem itself (e.g. on a computer where you can control the array topology and connect the edges, like the Goodyear MPP I worked on in the late 80's, that optimization would be pointless :) ).  But it's not nearly as "micro" as some of the crazy stuff I've seen (and tried, mostly unsuccessfully :) ) in the past.

Anyway, thanks for sharing.  I look forward to exploring this Parallel Extensions in the future.

Monday, June 02, 2008 3:00 AM by Peter Duniho

# re: More parallelisation fun: Conway's Game of Life

Very, very nice work.  Keep it up.

As promised:

www.microsoft.com/.../details.aspx

Monday, June 02, 2008 2:15 PM by Joe Duffy