The Development Jungle

Hacking a path through the complexities of software

Optimising Mandelbrot

with 3 comments

In my previous post I showed how I created a Mandelbrot fractal renderer in Clojure, however its performance left a lot to be desired. Optimising code is actually a very good way to learn more about a new language as it teaches you what is going on under the hood. In this post, I will go through the steps I took to drastically improve the application’s performance.

Inspecting the Code

The first step when optimising code (after not optimising of course) is to inspect the code’s logic and find quick fixes to any obvious potential performance areas. The logic of the Mandelbrot code however is fairly easy to reason about as we have built it up from the mathematical formula. Also, looking for obvious performance issues is difficult in a language you are trying to learn, so it’s time to break out the tools.

Profiling

First of all, let’s see how long it takes to render our fractal with the current implementation:

user=> (time (my-mandelbrot))
"Elapsed time: 29340.692742 msecs"
nil

Note, I have undone the premature optimisation I made last time and made the rendering single threaded by replacing pmap with map. This is important when profiling as it ensures we can properly determine which parts of the code are taking the longest. On a quad core cpu, I found that using pmap gave only a 2x speed increase which shows there is quite a lot of overhead in Clojure when processing on multiple threads.

Now at this stage, we have no idea what is taking so long so let’s run the app through a profiler. As Clojure is hosted on the JVM, it allows us to take advantage of many tools that would normally be used for Java applications. I attached the VisualVM profiling tool (which now comes with Java 6) to my application and then called my-mandelbrot again from the REPL. This is what I got after running for a few minutes:

What strikes you when digging through a Clojure call tree is just how deep it can go. It can also be difficult to relate the Java method calls we see back to the original Clojure code. Here I have expanded the calls made within the render function. You can see the two main consumers of CPU time are calls to count and vec. This is surprising as if you have a look at the original render function code, there’s not a call to either the count or vec functions to be seen:

(defn render [xstart ystart xsize ysize width height max-iters wr]
  (dorun
    (map (fn [pixel]
            (let [[x y] pixel]
              (.setPixel wr x y
                (int-array (coord-colour
                  (get-coord (double x) (double y) xstart ystart xsize ysize width height)
                  max-iters)))))
           (get-pixels width height))))

In fact, count is called from within the coord-colour function:

(defn coord-colour
  "Returns a colour for which to draw the given coordinate. If the coordinate
  is within the mandelbrot set, black is returned. Otherwise, a colour within
  a gradient is given based on the number of iterations of the mandelbrot set
  that have been evaluated."
  [[xcoord ycoord] max-iters]
  (let [num-iters (count (take max-iters (take-while #(<= (mag %) 4) (mandelformula xcoord ycoord))))]
    (if (= max-iters num-iters)
      *set-colour*
      (iter-colour num-iters max-iters))))

The call to vec is deeper still within the calls to iter-colour and grad-colour:

(defn grad-colour
  "Returns the colour that is the given fraction of the way between
  the first and second colours given. Returns as a vector of three
  integers between 0 and 255."
  [colA colB frac]
  (vec (map #(+ (* frac (- %2 %1)) %1) colA colB)))

(defn iter-colour
  "Returns the colour needed to paint a point with the given number
  of iterations"
  [num-iters max-iters]
  (grad-colour *grad-colour-a* *grad-colour-b*
    (/ (double num-iters) max-iters)))

The reason we see the count and vec calls at the level of the render function is that Clojure (or perhaps the hotspot) appears to inline certain function calls into the calling function. You can see further evidence of this in the above screenshot if you have a look at the divide, add and multiply methods. These are the operations performed by the get-coord function:

(defn get-coord
  "Returns the coordinates of the given pixel in the complex plane"
  [x y xstart ystart xsize ysize width height]
  [(+ xstart (* (/ x width) xsize))
   (+ ystart (* (/ y height) ysize))])

You should also see that at the time of the snapshot I took above, the number of invocations of each of the mathematical operations was around 67k which is twice the number of invocations of the count function which makes sense as you can see they are called once for each of the x and y coordinates.

Let’s dig deeper into the count function.

As you expand the call tree, and if you ignore the clojure.lang methods, you’ll see calls to take, take-while, iterate and finally mandelformula. Here’s another mystery of the Clojure call stack: we have a call to iterate before the call to mandelformula and yet in our code we can see iterate is called within the mandelformula function.

(defn mandelformula [xcoord ycoord]
  "Returns an infinite sequence of vectors containing the values of
  successive iterations of the mandelbrot formula, given a point on
  the complex plane."
  (iterate
    #(vec (map + (mandel %) [xcoord ycoord]))
    [xcoord ycoord]))

As far as I can tell, the reason for this is that mandelformula returns a lazy sequence which is not actually evaluated until it is needed. So iterate is used to get the next value in the sequence, while mandelformula actually does the job of calculating what the next value should be.

Drilling down into the mandelformula function we open up a whole can of method calls and it gets suddenly very difficult to map back to the original Clojure code. However we can see there seem to be a large number of sequences created and calls to methods in the clojure.lang.Numbers class taking Objects as their parameters. If we could get the formula to only use primitive types, this could potentially remove a lot of object creation and garbage collection. However, before we start re-working the code, let’s look at another profiling technique available to us in Clojure.

Using Clojure Contrib’s Profile Macros

Stuart Sierra has made a vast number of contributions to Clojure including the profile library which provides two macros prof and profile to measure the execution time of any block of code. Clojure’s macro system makes what would normally be a cumbersome task in other languages extremely easy. Let’s put profiling in our coord-colour function:

(defn coord-colour
  [[xcoord ycoord] max-iters]
  (let [num-iters (prof :mandel (count (take max-iters (take-while #(<= (mag %) 4) (mandelformula xcoord ycoord)))))]
    (prof :colour (if (= max-iters num-iters)
      *set-colour*
      (iter-colour num-iters max-iters)))))

We simply wrap the block of code we want to profile around a call to the prof macro and specify a keyword to identify that block of code. Calling the profile macro will then summarise the time taken to evaluate each call to those prof blocks and print a bunch of statistics.

user=> (time (profile (my-mandelbrot)))
  Name      mean       min       max     count       sum
colour      4875      1117   1091200    300000  1462510112
mandel     71729      2514 100182286    300000 21518722246
"Elapsed time: 31456.673404 msecs"

Here we can see that the total time taken by the mandel section of the coord-colour function is around 21 seconds (the values above are in nanoseconds) which gives it 68% of the total running time. The time taken to calculate the gradient colour is about 5% of the total running time. These are quite different statistics from what we saw with VisualVM which gave figures of 60% for the count function and 20% for the calls to vec. The discrepancy is most likely due to the fact that the program did not run to completion when I profiled it with VisualVM. The time taken by the count function depends very much on the number of iterations it executes and during the first 25% of the render, most of its calls finish within 4-5 iterations.

In any case, now that we know the source of the slow performance, let’s turn our attention to optimisation.

Back to Primitives

If we want to eliminate boxed numbers while calculating each term of the Mandelbrot formula, then unfortunately that means we will also have to stop being lazy. It’s not possible to create a lazy sequence of primitives (just yet). This means we will need to calculate all the terms of the Mandelbrot formula in a single function before returning the number of iterations taken.

(defn mandelformula [x0 y0 max-iters]
  "Applies the mandelbrot formula until max-iters iterations
  are reached, or the magnitude of Z exceeds 2"
  (let [x0 (double x0)
        y0 (double y0)
        max-iters (int max-iters)]
    (loop [x (double x0)
           y (double y0)
           n (int 0)]
      (if (== n max-iters)
        n
        (let [mag (+ (* x x) (* y y))]
          (if (>= mag (double 4))
            n
            (let [new-x (+ x0 (- (* x x) (* y y)))
                  new-y (+ y0 (* (double 2) (* x y)))]
              (recur new-x new-y (inc n)))))))))

Here is the newly improved function. Note how I explicitly coerce the function parameters to doubles using a let. The intermediate values for each iteration are now defined in a loop and again explicitly coerced. It can be tricky in Clojure to force primitives – there’s no static type checking as in Java of course – but Clojure will throw an exception if you try to pass non-primitive values to a loop that uses coerced parameters. This helps detect the times when Clojure magically autoboxes your primitives behind your back. (One place where I tripped up was thinking that mathematical operators that take multiple arguments would return a primitive when passed primitives, in fact, (* (int 1) (int 2) (int 3)) will return an Integer whereas (* (int 1) (int 2)) returns an int. This is why the calculation of new-y above involves two calls to *).

In the above function, I’m taking care of calculating all the terms of the mandelbrot formula up to the point where z > 2 or max-iters is reached. This is a bit more than the original mandelformula function so the coord-colour function simplifies to:

(defn coord-colour
  "Returns a colour for which to draw the given coordinate. If the coordinate
  is within the mandelbrot set, black is returned. Otherwise, a colour within
  a gradient is given based on the number of iterations of the mandelbrot set
  that have been evaluated."
  [[xcoord ycoord] max-iters]
  (let [num-iters (mandelformula xcoord ycoord max-iters)]
    (if (= max-iters num-iters)
      *set-colour*
      (iter-colour num-iters max-iters)
      )))

Let’s see what kind of performance we get now.

user=> (time (profile (my-mandelbrot)))
  Name      mean       min       max     count       sum
colour      5179      1117   1593220    300000  1553876060
mandel      1787      1396    137448    300000  536248035
"Elapsed time: 10177.274766 msecs"

Wow, that’s a 40x speed improvement for the mandel part of our code. This shows how significant keeping object allocations to a minimum can be. However, the two parts of the code we’re profiling only make up about 2 seconds out of a total of 10 seconds of runtime. Let’s move our profiling to the render function:

(defn render [xstart ystart xsize ysize width height max-iters wr]
  (dorun
    (map (fn [pixel]
            (let [[x y] pixel]
              (prof :setPixel (.setPixel wr x y
                (prof :int-array (int-array 
                  (prof :coord-colour (coord-colour
                    (get-coord (double x) (double y) xstart ystart xsize ysize width height)
                    max-iters))))))))
           (get-pixels width height))))

And run the render again:

user=> (time (profile (my-mandelbrot)))
        Name      mean       min       max     count       sum
coord-colour      6632      3073   1690997    300000  1989613029
   int-array     11402      7263   1706641    300000  3420617714
    setPixel     30688     21790   2191620    300000  9206612546
"Elapsed time: 11283.777407 msecs"

You can see that setPixel is most costly function, but this is not really a surprise given the other two calls are nested within it. However, even taking the runtime of the nested calls into account, the .setPixel call is taking nearly 6 out of the total 11 seconds (the reason we have an extra 1s of runtime after this run is most likely due to the overhead of profiling another block of code).

Removing Reflection

We can see that .setPixel is called on a binding called wr but Clojure (nor us) has no idea what type wr might be. In fact Clojure has to use reflection to determine which method .setPixel refers to. Let’s turn on reflection warning and load the program again:

Reflection warning, D:\dev\clojure\...\mandelbrot.clj:87 - call to setPixel can't be resolved.
Reflection warning, D:\dev\clojure\...\mandelbrot.clj:87 - call to setPixel can't be resolved.
Reflection warning, D:\dev\clojure\...\mandelbrot.clj:99 - call to drawImage can't be resolved.
Reflection warning, D:\dev\clojure\...\mandelbrot.clj:105 - call to setPreferredSize can't be resolved.
Reflection warning, D:\dev\clojure\...\mandelbrot.clj:106 - call to add can't be resolved.
Reflection warning, D:\dev\clojure\...\mandelbrot.clj:117 - reference to field getRaster can't be resolved.
Reflection warning, D:\dev\clojure\...\mandelbrot.clj:122 - reference to field repaint can't be resolved.
1:1 user=> 

We can see that with reflection warning turned on, Clojure complains that it cannot resolve the call to setPixel. We can fix this of course with a type hint, however we must also remember to coerce the parameters into the types expected by the setPixel method:

(defn render [xstart ystart xsize ysize width height max-iters #^WritableRaster wr]
  (dorun
    (map (fn [pixel]
            (let [[x y] pixel]
              (prof :setPixel (.setPixel wr (int x) (int y)
                (prof :int-array (int-array 
                  (prof :coord-colour (coord-colour
                    (get-coord (double x) (double y) xstart ystart xsize ysize width height)
                    max-iters))))))))
           (get-pixels width height))))

Running the program again we notice the reflection warning has gone (the others remain but we don’t care about them as their calls are not made within a loop), and we get a fairly significant speed up:

user=> (time (profile (my-mandelbrot)))
        Name      mean       min       max     count       sum
coord-colour      6770      3072  108868865    300000  2031060305
   int-array     12263      6704  221714492    300000  3679041832
    setPixel     16247     10057  221746898    300000  4874399625
"Elapsed time: 6739.067116 msecs"

This now means that the setPixel, int-array and coord-colour sections now take 17%, 25% and 30% of the total elapsed time respectively. These proportions seem pretty even so it’s possible that our profiling representing a significant portion of that elapsed time. Let’s run again, this time without making the call to the profile macro:

user=> (time (my-mandelbrot))
"Elapsed time: 1668.624085 msecs"

That’s quite a big drop and it shows how significant the overhead of profiling can be but I’m hoping that this can be reduced with optimisations to the profiling macros themselves in the future.

We saw earlier how the section of code for calculating a colour gradient was taking about 75% of the time of the mandelformula function. It’s possible this function could be sped up with more primitive type arithmetic but let’s try a different, more functional method of optimisation.

Memoization

The iter-colour function takes a number between 1 and max-iter and returns a corresponding colour. The fact that there is only a small, fixed number of possible inputs to the function and the output is always the same for a given input, makes it a perfect candidate for memoization. We can simply redefine the function as a memoized version of itself:

user=> (def iter-colour (memoize iter-colour ))
#'user/iter-colour
user=> (time (my-mandelbrot))
"Elapsed time: 954.501759 msecs"

We have now managed to improve overall performance by over 30 times. It’s possible we could repeat these techniques such as inlining the get-pixels function or making the get-coord function use primitives to get a further performance boost. However, as cgrand says, an optimisation job is never done and for now I believe the performance is “good enough”.

The full optimised code can be found on github. Please feel free to submit your own performance optimisations to improve the code even further :)

Conclusion

We’ve seen how to use profiling tools to peer beneath Clojure’s dynamic lispy surface and used this knowledge to dramatically improve the performance of our fractal renderer by taking advantage of function inlining, primitive coersions, elimination of reflection and memoization. Hopefully these techniques will be useful to you when improving the performance of your own Clojure programs!

References

Inspiration on how to write a Mandelbrot fractal renderer the ‘right’ way: http://briancarper.net/tag/mandelbrot
Performance tips on Clojure: http://devlog.bigmonachus.org/2009/03/performance-tips-for-clojure.html

About these ads

Written by alex

November 24, 2009 at 22:55

Posted in programming-languages

Tagged with ,

3 Responses

Subscribe to comments with RSS.

  1. This is a very good post! “Here’s another mystery of the Clojure call stack: we have a call to iterate before the call to mandelformula”
    You should notice that the classes names are iterate__5114$fn__5116 and mandelformula__28$fn__30 and not simply iterate__XXX or mandelformula__YYY: these are the names of anonymous functions inside of iterate and inside of mandelformula (and, true, each lazy-seq uses an anonymous function to delay evaluation).

    Christophe Grand

    November 25, 2009 at 06:27

  2. [...] Optimising Mandelbrot « The Development Jungle (tags: clojure optimization) [...]

  3. One comment: Replace the

    (dorun (map (fn [pixel] …) (get-pixels width height)))

    with a

    (doseq [pixel (get-pixels width height)] …).

    The latter doesn’t build a lazy sequence and is much more idiomatic.

    the-kenny

    February 26, 2012 at 16:23


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: