Jan 2011

Rapid Prototyping & Development of Infrastructure

Taken from wikipedia, software prototyping is:

Software prototyping,
refers to the activity of creating prototypes of software applications, i.e., incomplete versions of the software program being developed. It is an activity that occurs during certain software development and is comparable to prototyping as known from other fields, such as mechanical engineering or manufacturing.

link

While rapid application development is:

Rapid Application Development (RAD)
refers to a type of software development methodology that uses minimal planning in favor of rapid prototyping. The "planning" of software developed using RAD is interleaved with writing the software itself. The lack of extensive pre-planning generally allows software to be written much faster, and makes it easier to change requirements.

link

Can these same methods be applied to infrastructure? Or does infrastructure always have to be engineered? The real answer is of course (as usual per my essays) it depends. Instead of conjecturing when it might work this text will look at three examples. One where it did not work, one where it kind of worked until it went off the rails and one where it worked like a champ.

Didn't Work at All: Perceus for Cluster Management

In this particular scenario I needed to learn the basics of the Perceus Cluster Management software but I did not have the hardware in the datacenter yet. What I did have was practically a super computer for a laptop. Using virtualbox I created a 2 node cluster on a private network with a headnode. Everything appeared to work well but because I was not able to prototype the same networking and certain drivers, it ended up becoming a rather pointless endeavor. Essentially the virtualized environment was just too different than the hardware environment. From one perspective some goals were realized. I had learned how to configure some of the management software. I had even figured out how to create netboot images which paid off, however, the return on time was not worth it. It took me about 1 hour to figure out how to set up the cluster and create a netboot image. It took about 3 hours to create the virtual machines, install the head node, download and install software etc. Essentially I lost 2 hours.

Sort of Worked: SGE On a Cluster

Using the aformentioned virtual environment I figured out how to configure Sun Grid Engine. I ran a few jobs, modified some queues and thought *this is good* When the hardware was in place, I took it one sterp further. Since I still needed to setup all of the hardware and other software I decided to get two compute nodes functional, then give access to the a few select users under the conditions that they understood this thing could blow up at anytime and we are testing. This turned out to be a good idea. The users got me a list of software they needed installed, SGE configuration todos and some environmental settings. Even though I had to rebuild the head node due to a hardware issue, I now had a list of software to build to get the users up and running when it was ready for production. I called that a win.

Worked Well: VMware

Although it might be obvious at first why vmware would be good for organically testing something new, it is important to keep in mind I am referring to vmware itself. Not the guests. The approach I took to building a new vmware infrastructure was by incremental levels of risk. Eventually the infrstructure was to be designed like so:

  • 2 identical clusters with an infiniband back end and 2x1GB front esx hosts.
  • 3 4x8 core esx hosts on the front of each cluster with HA enabled
  • 1 single vcenter server backed up to tape
  • 2 storage nodes in failover over infiniband

Unfortunately, I did not have the infiniband or storage nodes but I wanted to get rolling. What I decided to do was implement the systems in steps as I waited for parts and pieces to come in. The steps I took were:

  1. Build vcenter
  2. Install ESXi on the 6 hosts
  3. Create one dev/test cluster with 3 hosts
  4. Informed developers to have at it but it was not in production yet
  5. Created the production cluster
  6. Setup temporary NFS storage for both clusters on the same 1GB switch
  7. Ported a monitoring system to a vm for production testing.
  8. Migrated everything to the temporary storage
  9. Put the temp storage NFS server into backup schema
  10. Informed developers that their cluster was essentially good to go and educated them on the restore time (roughly 1 working day)
  11. Put low hanging fruit guests into production. These would be guests that could be down for a day or more.
  12. Storage arrived. Installed OS and configured.
  13. Infiniband switch and cabling arrived. Installed.
  14. Migrated to infiniband connected storage.
  15. Informed user community of higher availability.
  16. Done.

From a certain point of view we essentially started testing the automobile before it was fully road ready. We prototyped as much as we could safely and even deployed certain systems into production on vmware as long as they could actually be down. Applying the rapid prototyping method worked perfectly in this scenario.

Summary

Probably what this text has covered has been done before. Most likely, however, many sysadmins don't neccessarily look at the process in this way. Some people say that adminsitrative/system programming is a completely different world than mainstream applications programming. While in some ways they are, it is clear that even building out certain types of infrastructure bear some commonalities to programming.