Friday, April 17, 2009

Performance Tuning Daylog for Hydra Cache M4

After almost 7 months (wow can't believe its been more than half year already) of hard work finally Hydra Cache project is reaching its Milestone 4 - Feature Completion, and very close to its public debut v1.0 RC1. You can find the official Project Wiki Site and the new Maven Project Portal for the code snapshot, javadoc, and various reports. Hydra Cache was built so far without performance tuning in mind. Yes you heard it right, performance tuning was not our focus at all throughout the entire development for the first 4 milestones. Our focus so far are on code quality, testability, architecture, and maintainability. We intentionally picked techniques that make our code succinct and readable over high performance ones, for example we used:

HashCodeBuilder.reflectionHashCode(this);

in hashcode() method instead of hand crafted bits shifting magic. Why? because its shorter and more readable. But.. but its using reflection! I know, but trust me you can write much slower code (a few order of magnitude slower) just a few blocks away from here with a simple log.debug(...) so why focus on this one, plus the more you delay the optimization the more options you will have to throw at a performance problem even if this one actually becomes a real bottleneck.

* Thanks to my team mate David Dossot to constantly reminding me at the beginning of the project to hold off the nerd inside of me that craves for high octane performance and some fancy hash code algorithms.

But now to reach RC1, one of the task is to actually optimize the performance since one of the goal for Hydra Cache is to provide highly scalable and fast caching system. First thing first, lets build a performance test tool kit so we can have a consistent and controled environment to benchmark Hydra. I picked Apache JMeter an open source performance testing tool as our testing framework since I have worked with JMeter before and plus of course its open source. Two JMeter Samplers were created for GET and PUT operation respectively, and a throughput test plan designed to stress the server in an artificial manner so we can measure the throughput and the cpu usage under stress. Lets fire it up and see what we got here. Since this is a not a real performance test to actually measure the throughput but rather a test designed to identify performance bottleneck, therefore it was being executed on my laptop (Intel Centrino Duo Core 2Ghz with 2Gb of RAM running on Sun JDK 1.6.0_13). 4 local Hydra servers were launched with 64MB of heap memory configured each, and the Hydra space configuration for each server node is identical with the following parameter: N=3, W=2, R=1. What this means is each PUT operation has to be sucessfully multi-casted and propogated to at least 1 (W) servers before the PUT operation is considered to be succesful and eventually up to 3 (N including the initial server that received request) servers will store the same replica in the space. However the GET operation will be performed by a single node only without multi-server consolidation. This is a typical configuration to offer reliable writes and redundency but optimized reads for better performance. Here is the performance with 10 test threads and 1000 writes/reads each.


GET Throughput Chart



PUT Throughput Chart

49k read/minute (800 read/sec) and 9k write/minute (150 write/sec) this is not too bad considering till now performance optimization is not even attempted nor considered during coding.

* I am going to stress here again this is not a throughput test but rather a test specially designed to pin-point performance bottleneck, so do not take the number here as the maximum throughput number for Hydra Cache

Now to find out what's hogging all the CPU cycle we need to do some profiling. Thanks to the fantastic EjTechnologies JProfiler and their much appreciated support on open source effort; here is the profiling result.


As you can see the top two hotspots are the toString() and hashCode() methods on Identity class. Aha~ remember the reflection based builder I have shown you at the beginning of the post? It was used in both of the these methods. But before you are going to tell me "I told you so", I need to explain that the same reflection builder was used in almost every model class, and a premature optimization would result in wasted energy on hand writing these methods for a lot of classes that is no way near the hotspots. Alright enough stalling lets roll up the sleeve and create a kick ass super fast implementation then, but hold on a second I remember the Identity class is a pretty special one. What? It is Immutable! Really? That changes the whole picture. If its immutable instead of computing the hashCode and toString every time we can just simply cache it and eliminate the cpu cost almost entirely. This shows another positive aspect of delaying optimization since the more you delay it the smaller you can pin-point the scope therefore more customized the solution can be tailor which usually means more effeciency. If we had applied a general optimization strategy up-front, most likely it would have resulted in a broad-stroke hand-crafted hashCode and toString for every class strategy, and we would have probably missed this caching opportunity since we were busy hand coding 50 different hashCode and toString. A few lines of code... taa daa ... done. Lets profile it again.


Alright much better now the top 5 hotspots each only contribute to 1% of the cpu time. Lets also fire of the JMeter to see the improvement in action.

GET Throughput Chart


PUT Throughput Chart

36k reads/minute and 12k writes/minute. Hmm... much better write throughput and slightly lower read throughput. Its a bit strange but considering the small test sample size and all the other craps running on my laptop this kind of fluctuation is in the margin, and since I am not really measuring the throughput I dont really care too much. Run all unit test cases, nothing is broken, feeling good and check all the changes in. Now its time to start the next iteration. Took a second look at the hotspot list and some numbers immediately caught my attention - the number of invocation for the top 5 hotspots, they seem to be a bit higher than my expectation. So whats causing all these invocations? Scan through the code and the references, ahaa the debug statement is the culprit this time. Although the debug logging is turned off to mimic production environment, however the statement is still getting executed. We are using Log4j so a regular debug statement looks like this:

log.debug("Something useful: " + object.getValue());

This statement will not produce a log but the string concatenation still happens and getValue() method still gets executed. A common approach to address this kind of performance penalty is to add a Code Guard like this:

if(log.isDebugEnabled())
log.debug("Something useful: " + object.getValue());

This way when the debug is turned off none of the code will actually get executed. Now once again delaying the optimization gave me the opportunity to identify and guard only the expensive debug statement without enforcing everyone to follow this practice throughout the project which is much cheaper and just as efficient. Once again a few lines of code I am done, lets profile it again.


Great! 4 out of 5 top hotspots were demoted to less than 1% cpu time. Alright time to look at it from a different angle now. Lets look at the aggregated cpu time per methods.


Bingo! There is a big cpu hog - JGroupSpace.findSubstancesForLocalNode(). Hmm... make sense this is a pretty expensive computation and it happens every time a PUT or GET operation is invoked. After looking at the code I realize there is already a layer of abstraction introduced to on top of that but it currently does not have the capability to compute the substances for the local node thats why everyone is calling this method to compute everything from scratch at the space level. It makes sense to pull it up into the intermediate layer since we can cache the computation result there when the space is stable and only recompute when member node leave or join the space, and even better the intermediate layer is already getting the notification for this kind of membership change for other functions anyway. Alright sold! thats the plan. Start with some new test cases and refactoring, a few iteration later, done! All test cases pass, a miracle :) Quickly check them in. Profile it again ...



MultiplexMessageReceiver.receive() dropped from 25ms for 4k invokation to 4ms, not too shaby at all. Now lets put everything together and see it in action. Fire up the JMeter!


GET Throughput Chart


PUT Throughput Chart

A whooping 81k reads/minute and 11k writes/minute, so overall reads are 2-3 times faster now and writes are about 20-30% faster. Not too bad for an afternoon session heh? Fire off the integration test cases and nothing is broken, great mission accomplished. My next step is to observe and profile the memory usage pattern to first of all make sure there is no memory leak and secondly optimize the garbage collection behavior. Maybe its also time to put Hydra in a real performance testing lab to find out the actual throughput ceiling.

Stay tuned and Hydra Cache RC1 will be available for download this summer.