In my previous post I showed how I created a Mandelbrot fractal renderer in Clojure, however its performance left a lot to be desired. Optimising code is actually a very good way to learn more about a new language as it teaches you what is going on under the hood. In this post, I will go through the steps I took to drastically improve the application’s performance.
Inspecting the Code
The first step when optimising code (after not optimising of course) is to inspect the code’s logic and find quick fixes to any obvious potential performance areas. The logic of the Mandelbrot code however is fairly easy to reason about as we have built it up from the mathematical formula. Also, looking for obvious performance issues is difficult in a language you are trying to learn, so it’s time to break out the tools.
First of all, let’s see how long it takes to render our fractal with the current implementation:
user=> (time (my-mandelbrot)) "Elapsed time: 29340.692742 msecs" nil
Note, I have undone the premature optimisation I made last time and made the rendering single threaded by replacing
map. This is important when profiling as it ensures we can properly determine which parts of the code are taking the longest. On a quad core cpu, I found that using
pmap gave only a 2x speed increase which shows there is quite a lot of overhead in Clojure when processing on multiple threads.
Now at this stage, we have no idea what is taking so long so let’s run the app through a profiler. As Clojure is hosted on the JVM, it allows us to take advantage of many tools that would normally be used for Java applications. I attached the VisualVM profiling tool (which now comes with Java 6) to my application and then called
my-mandelbrot again from the REPL. This is what I got after running for a few minutes:
What strikes you when digging through a Clojure call tree is just how deep it can go. It can also be difficult to relate the Java method calls we see back to the original Clojure code. Here I have expanded the calls made within the
render function. You can see the two main consumers of CPU time are calls to
vec. This is surprising as if you have a look at the original
render function code, there’s not a call to either the
vec functions to be seen:
(defn render [xstart ystart xsize ysize width height max-iters wr] (dorun (map (fn [pixel] (let [[x y] pixel] (.setPixel wr x y (int-array (coord-colour (get-coord (double x) (double y) xstart ystart xsize ysize width height) max-iters))))) (get-pixels width height))))
count is called from within the
(defn coord-colour "Returns a colour for which to draw the given coordinate. If the coordinate is within the mandelbrot set, black is returned. Otherwise, a colour within a gradient is given based on the number of iterations of the mandelbrot set that have been evaluated." [[xcoord ycoord] max-iters] (let [num-iters (count (take max-iters (take-while #(<= (mag %) 4) (mandelformula xcoord ycoord))))] (if (= max-iters num-iters) *set-colour* (iter-colour num-iters max-iters))))
The call to
vec is deeper still within the calls to
(defn grad-colour "Returns the colour that is the given fraction of the way between the first and second colours given. Returns as a vector of three integers between 0 and 255." [colA colB frac] (vec (map #(+ (* frac (- %2 %1)) %1) colA colB))) (defn iter-colour "Returns the colour needed to paint a point with the given number of iterations" [num-iters max-iters] (grad-colour *grad-colour-a* *grad-colour-b* (/ (double num-iters) max-iters)))
The reason we see the
vec calls at the level of the
render function is that Clojure (or perhaps the hotspot) appears to inline certain function calls into the calling function. You can see further evidence of this in the above screenshot if you have a look at the
multiply methods. These are the operations performed by the
(defn get-coord "Returns the coordinates of the given pixel in the complex plane" [x y xstart ystart xsize ysize width height] [(+ xstart (* (/ x width) xsize)) (+ ystart (* (/ y height) ysize))])
You should also see that at the time of the snapshot I took above, the number of invocations of each of the mathematical operations was around 67k which is twice the number of invocations of the
count function which makes sense as you can see they are called once for each of the x and y coordinates.
Let’s dig deeper into the
As you expand the call tree, and if you ignore the
clojure.lang methods, you’ll see calls to
iterate and finally
mandelformula. Here’s another mystery of the Clojure call stack: we have a call to
iterate before the call to
mandelformula and yet in our code we can see iterate is called within the
(defn mandelformula [xcoord ycoord] "Returns an infinite sequence of vectors containing the values of successive iterations of the mandelbrot formula, given a point on the complex plane." (iterate #(vec (map + (mandel %) [xcoord ycoord])) [xcoord ycoord]))
As far as I can tell, the reason for this is that
mandelformula returns a lazy sequence which is not actually evaluated until it is needed. So
iterate is used to get the next value in the sequence, while
mandelformula actually does the job of calculating what the next value should be.
Drilling down into the
mandelformula function we open up a whole can of method calls and it gets suddenly very difficult to map back to the original Clojure code. However we can see there seem to be a large number of sequences created and calls to methods in the
clojure.lang.Numbers class taking
Objects as their parameters. If we could get the formula to only use primitive types, this could potentially remove a lot of object creation and garbage collection. However, before we start re-working the code, let’s look at another profiling technique available to us in Clojure.
Using Clojure Contrib’s Profile Macros
Stuart Sierra has made a vast number of contributions to Clojure including the profile library which provides two macros
profile to measure the execution time of any block of code. Clojure’s macro system makes what would normally be a cumbersome task in other languages extremely easy. Let’s put profiling in our
(defn coord-colour [[xcoord ycoord] max-iters] (let [num-iters (prof :mandel (count (take max-iters (take-while #(<= (mag %) 4) (mandelformula xcoord ycoord)))))] (prof :colour (if (= max-iters num-iters) *set-colour* (iter-colour num-iters max-iters)))))
We simply wrap the block of code we want to profile around a call to the
prof macro and specify a keyword to identify that block of code. Calling the
profile macro will then summarise the time taken to evaluate each call to those
prof blocks and print a bunch of statistics.
user=> (time (profile (my-mandelbrot))) Name mean min max count sum colour 4875 1117 1091200 300000 1462510112 mandel 71729 2514 100182286 300000 21518722246 "Elapsed time: 31456.673404 msecs"
Here we can see that the total time taken by the
mandel section of the
coord-colour function is around 21 seconds (the values above are in nanoseconds) which gives it 68% of the total running time. The time taken to calculate the gradient colour is about 5% of the total running time. These are quite different statistics from what we saw with VisualVM which gave figures of 60% for the
count function and 20% for the calls to
vec. The discrepancy is most likely due to the fact that the program did not run to completion when I profiled it with VisualVM. The time taken by the
count function depends very much on the number of iterations it executes and during the first 25% of the render, most of its calls finish within 4-5 iterations.
In any case, now that we know the source of the slow performance, let’s turn our attention to optimisation.
Back to Primitives
If we want to eliminate boxed numbers while calculating each term of the Mandelbrot formula, then unfortunately that means we will also have to stop being lazy. It’s not possible to create a lazy sequence of primitives (just yet). This means we will need to calculate all the terms of the Mandelbrot formula in a single function before returning the number of iterations taken.
(defn mandelformula [x0 y0 max-iters] "Applies the mandelbrot formula until max-iters iterations are reached, or the magnitude of Z exceeds 2" (let [x0 (double x0) y0 (double y0) max-iters (int max-iters)] (loop [x (double x0) y (double y0) n (int 0)] (if (== n max-iters) n (let [mag (+ (* x x) (* y y))] (if (>= mag (double 4)) n (let [new-x (+ x0 (- (* x x) (* y y))) new-y (+ y0 (* (double 2) (* x y)))] (recur new-x new-y (inc n)))))))))
Here is the newly improved function. Note how I explicitly coerce the function parameters to doubles using a
let. The intermediate values for each iteration are now defined in a
loop and again explicitly coerced. It can be tricky in Clojure to force primitives – there’s no static type checking as in Java of course – but Clojure will throw an exception if you try to pass non-primitive values to a loop that uses coerced parameters. This helps detect the times when Clojure magically autoboxes your primitives behind your back. (One place where I tripped up was thinking that mathematical operators that take multiple arguments would return a primitive when passed primitives, in fact, (* (int 1) (int 2) (int 3)) will return an Integer whereas (* (int 1) (int 2)) returns an
int. This is why the calculation of
new-y above involves two calls to *).
In the above function, I’m taking care of calculating all the terms of the mandelbrot formula up to the point where
z > 2 or
max-iters is reached. This is a bit more than the original
mandelformula function so the
coord-colour function simplifies to:
(defn coord-colour "Returns a colour for which to draw the given coordinate. If the coordinate is within the mandelbrot set, black is returned. Otherwise, a colour within a gradient is given based on the number of iterations of the mandelbrot set that have been evaluated." [[xcoord ycoord] max-iters] (let [num-iters (mandelformula xcoord ycoord max-iters)] (if (= max-iters num-iters) *set-colour* (iter-colour num-iters max-iters) )))
Let’s see what kind of performance we get now.
user=> (time (profile (my-mandelbrot))) Name mean min max count sum colour 5179 1117 1593220 300000 1553876060 mandel 1787 1396 137448 300000 536248035 "Elapsed time: 10177.274766 msecs"
Wow, that’s a 40x speed improvement for the
mandel part of our code. This shows how significant keeping object allocations to a minimum can be. However, the two parts of the code we’re profiling only make up about 2 seconds out of a total of 10 seconds of runtime. Let’s move our profiling to the
(defn render [xstart ystart xsize ysize width height max-iters wr] (dorun (map (fn [pixel] (let [[x y] pixel] (prof :setPixel (.setPixel wr x y (prof :int-array (int-array (prof :coord-colour (coord-colour (get-coord (double x) (double y) xstart ystart xsize ysize width height) max-iters)))))))) (get-pixels width height))))
And run the render again:
user=> (time (profile (my-mandelbrot))) Name mean min max count sum coord-colour 6632 3073 1690997 300000 1989613029 int-array 11402 7263 1706641 300000 3420617714 setPixel 30688 21790 2191620 300000 9206612546 "Elapsed time: 11283.777407 msecs"
You can see that
setPixel is most costly function, but this is not really a surprise given the other two calls are nested within it. However, even taking the runtime of the nested calls into account, the
.setPixel call is taking nearly 6 out of the total 11 seconds (the reason we have an extra 1s of runtime after this run is most likely due to the overhead of profiling another block of code).
We can see that
.setPixel is called on a binding called
wr but Clojure (nor us) has no idea what type wr might be. In fact Clojure has to use reflection to determine which method
.setPixel refers to. Let’s turn on reflection warning and load the program again:
Reflection warning, D:\dev\clojure\...\mandelbrot.clj:87 - call to setPixel can't be resolved. Reflection warning, D:\dev\clojure\...\mandelbrot.clj:87 - call to setPixel can't be resolved. Reflection warning, D:\dev\clojure\...\mandelbrot.clj:99 - call to drawImage can't be resolved. Reflection warning, D:\dev\clojure\...\mandelbrot.clj:105 - call to setPreferredSize can't be resolved. Reflection warning, D:\dev\clojure\...\mandelbrot.clj:106 - call to add can't be resolved. Reflection warning, D:\dev\clojure\...\mandelbrot.clj:117 - reference to field getRaster can't be resolved. Reflection warning, D:\dev\clojure\...\mandelbrot.clj:122 - reference to field repaint can't be resolved. 1:1 user=>
We can see that with reflection warning turned on, Clojure complains that it cannot resolve the call to
setPixel. We can fix this of course with a type hint, however we must also remember to coerce the parameters into the types expected by the
(defn render [xstart ystart xsize ysize width height max-iters #^WritableRaster wr] (dorun (map (fn [pixel] (let [[x y] pixel] (prof :setPixel (.setPixel wr (int x) (int y) (prof :int-array (int-array (prof :coord-colour (coord-colour (get-coord (double x) (double y) xstart ystart xsize ysize width height) max-iters)))))))) (get-pixels width height))))
Running the program again we notice the reflection warning has gone (the others remain but we don’t care about them as their calls are not made within a loop), and we get a fairly significant speed up:
user=> (time (profile (my-mandelbrot))) Name mean min max count sum coord-colour 6770 3072 108868865 300000 2031060305 int-array 12263 6704 221714492 300000 3679041832 setPixel 16247 10057 221746898 300000 4874399625 "Elapsed time: 6739.067116 msecs"
This now means that the
coord-colour sections now take 17%, 25% and 30% of the total elapsed time respectively. These proportions seem pretty even so it’s possible that our profiling representing a significant portion of that elapsed time. Let’s run again, this time without making the call to the
user=> (time (my-mandelbrot)) "Elapsed time: 1668.624085 msecs"
That’s quite a big drop and it shows how significant the overhead of profiling can be but I’m hoping that this can be reduced with optimisations to the profiling macros themselves in the future.
We saw earlier how the section of code for calculating a colour gradient was taking about 75% of the time of the
mandelformula function. It’s possible this function could be sped up with more primitive type arithmetic but let’s try a different, more functional method of optimisation.
iter-colour function takes a number between 1 and max-iter and returns a corresponding colour. The fact that there is only a small, fixed number of possible inputs to the function and the output is always the same for a given input, makes it a perfect candidate for memoization. We can simply redefine the function as a memoized version of itself:
user=> (def iter-colour (memoize iter-colour )) #'user/iter-colour user=> (time (my-mandelbrot)) "Elapsed time: 954.501759 msecs"
We have now managed to improve overall performance by over 30 times. It’s possible we could repeat these techniques such as inlining the get-pixels function or making the get-coord function use primitives to get a further performance boost. However, as cgrand says, an optimisation job is never done and for now I believe the performance is “good enough”.
The full optimised code can be found on github. Please feel free to submit your own performance optimisations to improve the code even further :)
We’ve seen how to use profiling tools to peer beneath Clojure’s dynamic lispy surface and used this knowledge to dramatically improve the performance of our fractal renderer by taking advantage of function inlining, primitive coersions, elimination of reflection and memoization. Hopefully these techniques will be useful to you when improving the performance of your own Clojure programs!
Inspiration on how to write a Mandelbrot fractal renderer the ‘right’ way: http://briancarper.net/tag/mandelbrot
Performance tips on Clojure: http://devlog.bigmonachus.org/2009/03/performance-tips-for-clojure.html