Running your own hardware Vs EC2 and RightScale — Part 2
September 16, 2008
This week I’ve been reminded of a very important lesson… No matter how abstracted you are from your hardware, you still inherently rely on its smooth and consistent operation.
This past week CitySquares‘ NFS server went down for the count and was completely unresponsive to any type of communication. In fact, the EC2 instance was so FUBAR we couldn’t even terminate it from our RightScale dashboard. A post on Amazon’s EC2 board was required to terminate it. Turns out the actual hardware our instance was running on had a catastrophic failure of some sort. Otherwise, at least so I’m told, server images are usually migrated off of machines running in a degraded state automatically.
Needless to say, the very reasons for deciding against running our own hardware have come back to plague us. Granted we weren’t responsible for replacing the hardware but we were still affected by the troublesome machine. We weren’t just slightly affected by the loss of our NFS server either. Since we are running off of a heavily modified Drupal CMS our web servers depend on having a writable files directory. As it turned out Apache just spun waiting for a response from the file system, our web services ground to a halt waiting on a machine that was never going to respond… ever. Talk about a single point of failure! A non critical component, serving mainly images and photos managed to take down our entire production deployment.
This event has prompted us to move forward with a rewrite of Drupal’s core file handling functionality. The rewrite will include automatically directing file uploads to a separate domain name like csimg.com or something similar. Yahoo goes into more detail with their performance best practices. However, editing the Drupal core is generally frowned upon and heavily discouraged since it usually conflicts with the upgrade path and maintainability of the Drupal core becomes much more difficult. While we haven’t stayed out of the Drupal core entirely, the changes we have made are minor and only for performance improvements. I believe it is possible to stay out of the core file handling by hooking into it with the nodeapi but it seems like more trouble than its worth.
The idea behind the file handling rewrite is to serve our images and photos directly from our Co-Location while keeping a local files directory on each EC2 instance for non user committed things like CSS and JS aggregation caching among other simple cache related items coming from the Drupal core. This rewrite will allow us to run one less EC2 instance, saving us some money as well as remove our dependence on a catastrophic single point of failure.
For the time being we have set up another NFS server. This time based on Amazon’s new EBS product. I spoke about this in a previous post. One of the issues we had when the last NFS server went down was the loss of user generated content. Once the instance went down all the storage associated with that instance went down with it. There was no way to recover from the loss, it was just gone. This is just one of the many possible problems you can run into with the cloud. While on the pro side, you don’t have to worry about owning your own hardware, the con side is you cant recover from failures like you can with your own hardware. This is a very distinct difference and should be seriously considered before dumping your current architecture for the cloud.