Monday, September 29, 2008

Want Groovy?

Groovy as a dynamic JVM powered language (JSR-241) has been gaining a lot of momentum and attention recently, especially when the Grails - a web application development framework built using Groovy with design similar to Rails. Some people even started to consider Groovy as a better version of Java, which I don't personally agree but I do think that Groovy is better suited for many tasks typicall performed by Java triditionally, such as building DSL, creating dynamic framework, and more.

Despite of all the grooviness about Groovy, it is still pretty difficult for any organization to adopt this relately young technology due to a classical chicken-and-egg delimma. Before formal adoption in a coporation settings, most of us would like to try the language out, but without formal adoption in a real project its almost impossible to really evaluate and learn the language, so what do we do? One effective way I found to introduce Groovy into your organization is starting with writing your unit tests in Groovy first. Because its just the test code usually there is less red tapes on it and since it will never be deployed into a production environment typically its a lot easier to get approval for trying it out. With the help of Maven and Groovy plugin its actually quite easy to add some grooviness to your project.

Step 1 - Add GMaven plugin to your pom.xml



org.codehaus.groovy.maven
gmaven-plugin



generateTestStubs
testCompile





${pom.basedir}/src/test/java


**/*.groovy









This definition basically tells Maven to use GMaven plugin to compile all *.groovy files under your standard test/java directory, which esentially allows you to write unit tests in both Java and Groovy.

Step 2 - Add Groovy runtime to your test classpath



org.codehaus.groovy.maven.runtime
gmaven-runtime-default
1.0-rc-3
test



By add this dependency to your dependencies will add Groovy runtime to your project and Eclipse classpath if you are using eclipse:eclipse plugin.

Step 3 - Install Groovy Eclipse plugin (Optional)

If you are using Eclipse, you might find its useful to install the Groovy plugin for your IDE although this plugin still has a few rough edges, it allows you to run your Groovy powered unit tests using GUnit which I found is a productivity boost.

Groovy! Now you are free to writing your unit and integration tests in both Groovy or Java, hence free to try out the language and feature at your own pace. Have fun.

Friday, September 26, 2008

New Open Source Project - Hydra Cache

Inspired by our recent experience and events, a few of my friends and I started a new open source project called Hydra aiming to provide the community a open source implementation of Amazon Dynamo in Java. The project design is based on the published papers and algorithms in public domain only and mainly Werner's paper on Amazon Dynamo. Currently the project is in its design and prototype stage. If you are interested in this project, check out our Wiki at HydraCache.org and if you are interested to contribute as a developer please contact the project admins at Hydra Project Page.

Tuesday, September 16, 2008

Consistent Hash based Distributed Data Storage

Challenge-

In an ultra large enterprise application, such as an online e-commerce or online gaming site, the site is dealing with millions of users and thousands of transactions every second. To handle this kind of traffic the number of servers, routers, databases, and storage hardware makes hardware or network failure a norm instead of an exception. Despite of the constant hardware failure in your system, your customer will not tolerate the slightest down time; the more successful your system is the more important it becomes to your client, and less happy they are when they experience an outage.

Solution -

To solve this challenge we need a highly available, decentralized, and high performance data storage service shielding the application from the harsh reality and complexity. No exception will be thrown when hardware or network failure occurs and the application code can always safely assume that the data storage service is available to write and read at any given point of time. This data storage service also needs to be evolutionarily scalable since down time is not acceptable, thus adding new node and storage capacity should not require shutting down the entire system, and it should only have limited impact on the service and it's client. A bonus side effect of this solution is that the distributed data storage can also act as a distributed cache system to reduce the hit to the persistent data storage such as a relational database.

Design -

- Consistent Hash

In a large online application, the type of data that require this kind of high availability are usually data that can be identified by a primary key and stored as binary content, for example user session(session id), shopping cart(cart id), preferences(user id), and etc,. Due to this nature, a Consistent Hash based distributed storage solution was proposed. Consistent Hash algorithm was initially introduced in Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web by David Karger in 1997. The key benefit of Consistent Hashing is that hashing any given key will always return the same value even when new slots are added or removed. The principle of this design can be illustrated using the following diagram. Imagine a virtual clock that represents an integer value from -231 to 231-1, and each server (A, B and C) in the storage cluster has a hash value assigned, hence for each given key (K1 and K2) can only land somewhere between these server nodes on the clock. The algorithm will search the clock clock-wise and pick the first server it encounters as the storage server for the given key, and because the hashing algorithm is consistent therefore any future put or get operation is guaranteed to be performed on the same node. Moreover the consistent hash algorithm also minimiz the impact for adding and removing node to its neighboring nodes instead of the entire cluster.


Challenge 1: Non-Uniformed Distribution

The first problem we need to solve is that the server hash value are most likely not uniformly distributed on the clock, as the result the server utilization will be skewed which is hardly an ideal situation. To solve this problem we are planning to borrow the idea discussed in Werner's paper on Amazon Dynamo by creating virtual nodes for the servers, and when you have enough virtual nodes created on the clock a close to uniformed distribution can be achieved.


Challenge 2: Availability and Replication

To provide high availability, the stored data need to be replicated to multiple servers. Based on the algorithm we are employing, in the case of a server failure any data stored on this specific server will automatically become the subjacent server's responsibility as shown in the following diagram.

Therefore our replication strategy is quite simple that every node will replicate the data it stores to it's immediate subjacent neighboring server. You can also include more than two servers in the replication group for even higher availability, although in our project I believe paired availability server group will provide desired availability without introducing too much complexity.
Open Source Alternative -

Open source Consistent Hash based distributed cache Memcached is built based on the similar design but without the high availability replication capability. It expects the application code being able to recover and restore the cache when a node becomes unavailable, which is usually an acceptable alternative for replication based availability at the cost of performance penalty during outage and increased code complexity. I usually recommend Memcached over custom in-house distributed cache system, since its proven, free, and a lot less work; what more can you ask :-)

Further Improvement -

Although its currently not in the plan but down the road there might be a need to implement Vector Clock based object versioning and merging capability to support simultaneous writes on multiple nodes, which is crucial for maintaining data consistency during partial system failure. Currently we are simply planning to employ "last-write-win" strategy to resolve the conflict.

Related Readings -

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

Amazon Dynamo

Tim's Blog on Consistent Hash

Thursday, September 04, 2008

Enable USB in VMWare under Ubuntu 8.04 - Hardy Heron

Just figured out how to enable USD controllers in VMWare windows instance under Ubuntu 8.04 - Hardy Heron, thought it might be useful to share it here in my blog.

If you just upgraded to Hardy Heron, you will notice that the USB are no longer working for your VMWare instance, its because that the Ubuntu development team removed /proc/bus/usb mount, and thats what VMWare depends on for detecting USB devices. To re-enable this is actually quite simple, just modify the device mount script "sudo vim /etc/init.d/mountdevsubfs.sh" and uncomment the following part:

#
# Magic to make /proc/bus/usb work
#
mkdir -p /dev/bus/usb/.usbfs
domount usbfs "" /dev/bus/usb/.usbfs -obusmode=0700,devmode=0600,listmode=0644
ln -s .usbfs/devices /dev/bus/usb/devices
mount --rbind /dev/bus/usb /proc/bus/usb

Then restart the mount "sudo /etc/init.d/mountdevsubfs.sh start", now you should be able to add USB device to VMWare instance again. Have fun!