Thursday, October 28, 2004

What is Clustering

Recent years seem to have sparked a lot of interest in Clustering Technology. In one form or another, clusters will exist and seamless in our everyday lives and vital for the future of computing.

Clustering is not a new methodology. Clusters have been around for decades! There are myriad ways why one would “cluster” computers. Some, because they demand computing power that “ordinary” computers can not provide or because their setup demands redundancy at multiple levels. Clusters are aptly named because you put together individual computers to serve a purpose and typically modern clusters are designed in such a way that the end user “sees” just one machine.

We can generally define two distinct methodologies of clusters: Highly Available Computing (HA) and High Performance Computing (HPC). Each have their own distinct purpose.

Lets talk HA. Highly Available Computing is so named because they are deployed in mission critical setups. It simply means that at no time must the service be denied a user. These are done by Data Centers, Financial Institutions, Media Groups, Email providers, Governments, Military, DNS servers, etc. HA have redundancy built into them. Thats the whole purpose. The “Machine” typically have multiple hard drives, power supplies and entire systems that should any one item or “point of failure” go down another machine or device must take its place. Yet at the same time, data must be secure and available and this action must be seamless to the end user!

Another methodology is called High Performance Computing (HPC). True, HPC is not limited to clustering technology and can mean massively parallel processors but for our discussion we will limit ourselves to HPC as they pertain to Clusters.

As its name implies, High Performance Computing's focus is crunching numbers. Thats its primary purpose. These machines are typically used in laboratory settings--- their purpose in life is to crunch numbers, decode DNA, predict the weather, track satellites and search for that obscure baby name site on the Internet. A pretty good example of HPC are those deployed by Google which they use to search information quickly.

David Becker and Thomas Sterling of NASA back in the 1990s had this idea. Why not build a High Performance Computer out of commodity components? PCs and PC parts had gone to price levels that were so cheap that coupling of PCs to build a supercomputer was possible. Enter the age of Commodity Off-the Shelf-based High Performance Computing which was called “Beowulf”.

Beowulfs are highly scalable meaning you can just add and add more “nodes” to your machine to increase performance. Beowulf runs on Linux and its future is bright as niche market in the ever increasing high performance computing world. Today, Beowulf is an accepted genre in High Performance Computing. (for more information on Beowulfs click here).

Microsoft did a similar project but for Windows way back when called the Wolfpack. Essentially the purpose was to deploy HPC/HA using Windows. However, most people well the sane ones would prefer to deploy Unix and/or its variants e.g. Linux, BSD, etc. on an HA/HPC. Well its much easier on the latter than the former in our humble opinion but thats just us.

Anyway, there are myriad ways to skin a cat, as they say. Present day network technologies allow off the shelf deployment of Cluster technology and its not just Beowulf. Apple Macs utilizing their Rendezvous software solution, Apple Airport (WiFi), Mac OS X and Xgrid software can get you started on a “two” node cluster and can scale appropriately. Linux Virtual Server Project (www.linuxvirtualserver.org) can help you build affordable Highly Available Cluster and then there is something called a Grid (or you can also visit Globus) which is the coupling on not only the hardware/network level but application layer level as well to form inter-operable highly available and high performance systems.

Isn't it all exciting and stimulating?

References and Further reading:

IBM DeveloperWorks Bleeding-Edge Stuff, lots of tutorials on these "edge" technologies. also can be found here.


for linux:
The Linux Documentation Project - everything you need to know about linux.

Linux Virtual Server Project - good primer on clusters and clustering methodologies

Linux Iso Images - site where you can download an iso image to burn so that you can run linux now.

Suse Linux - one of the best commercial distributions of linux owned by Novell

the Debian Project the best linux distribution in my humble opinion on the planet today because of "apt", just don't get involve with the politics!

Knoppix Project! Linux on a LiveCD! it means, you can run the operating system by simply booting the CD or DVD. :D great no? especially when you don't have a hard drive or machine to spare to play with linux!


The Beowulf:

Beowulf Project - Number one resource on the Internet for Beowulf.

Information about Grids:

Grid

Globus Project this will driving Grid technology today and in the future!

Apple Mac:
Apple XGrid - really nice piece of technology. makes life easier.

Microsoft:
Microsoft Wolfpack - article on wolfpack.

Eclipse Project:
Eclispe Project - one of the best development platform out there. could be useful when you start developing Cluster/Grid Apps. very scalable piece of software. It was developed by Big Blue

No comments: