Just as with any platform you choose, EC2 has its own limitations as well. These limitations are often different and harder to overcome than what you might find while running your own hardware. Without the proper planning and development, these limitations can wind up being extremely detrimental to the well being and scalability of your website or service.

There are quite a few blogs, articles and reviews out there that mention all the positive aspects of EC2 and I have written a few of them myself. However, I think users need to be informed of the negative aspects of a particular platform as well as the positive. I will be brief with this post as my next will focus on designing an architecture around these limitations.

The biggest limitations of Amazon’s EC2 at the moment as I have experienced, are the latencies between instances, latencies between instances and storage (local, and EBS), and a lack of powerful instances with more than 15GB of RAM and 4 virtual CPUs.

All the latency issues can all be traced back to the same root cause, a shared LAN with thousands of non localized instances all competing for bandwidth. Normally, one would think a LAN would be quick… and they generally are, especially when the servers are sitting right next to each other with a single switch sitting in between them. However, Amazon’s network is much more extensive than most local LANs and chances are your packets are hitting multiple switches and routers on their way from one instance to another. Every extra node added between instances is just another few milliseconds that get added to the packet’s round trip time. You can think of Amazon’s LAN as a really small Internet. The layout of Amazon’s LAN is very similar to that of the Internet, there is no cohesiveness or localization of instances in relation to one another. So lots of data has to go from one end of the LAN to the other, just like on the Internet. This leads to data traveling much farther than it needs to and all the congestion problems that are found on the Internet can be found on Amazon’s LAN.

For computationally intensive tasks this really isn’t too big a deal but for those who rely on speedy database calls every millisecond added per request really starts adding up if you have lots of requests per page. When the CitySquares site moved from our own local servers to EC2 we noticed a 4-10x increase in query times which we attribute mainly to the high latency of the LAN. Since our servers are no longer within feet of each other, we have to contend with longer distances between instances and congestion on the LAN.

Another thing to take into consideration is the network latency for Amazon’s EBS. For applications that move around a lot of data, EBS is probably a god send as it has a high bandwidth capability. However, in CitySquares’ case, we wind up doing a lot of small file transfers to and from our NFS server as well as EBS volumes. So while there is a lot of bandwidth available to us, we can’t really take advantage of it, especially since we have to contend with the latency and overhead of transferring many small files. Not only are small files an issue for us but we also run our MySQL database off of an EBS volume. Swapping to disk has always been a critical issue for databases but the added overhead of network traffic can wreak havoc on your database load much more than normal disk swapping. You can think of the difference in access times from disk to disk over a network as a book on a bookcase vs a book somewhere down the hall in storage room B. Clearly the second option would take far longer to find what you are looking for and that’s what you have to work with if you want to have the piece of mind of persistent storage.

The last and most important limitation for us at CitySquares was the lack of an all powerful machine. The largest instance Amazon has to offer is one with just 15GB of ram and 4 virtual CPUs. In a day and age where you can easily find machines with 64GB of RAM and 16 CPUs, you are definitely limited by Amazon. In our case, it would be much easier for us just to throw hardware at our database to scale up but the only thing we have at our disposal is a paltry 15GB of RAM. How can this be the biggest machine they offer? Instead of dividing one of those machines in quarters just give me the whole thing. It just seems ludicrous to me that the largest machine they offer is something not much more powerful than the computer I’m using right now.

Long story short, just because you start using Amazon’s AWS doesn’t mean you can scale. Make sure your architecture is tolerant of higher latencies and can scale with lots of little machines because that’s all you have to work with.

This week I’ve been reminded of a very important lesson… No matter how abstracted you are from your hardware, you still inherently rely on its smooth and consistent operation.

This past week CitySquares‘ NFS server went down for the count and was completely unresponsive to any type of communication. In fact, the EC2 instance was so FUBAR we couldn’t even terminate it from our RightScale dashboard. A post on Amazon’s EC2 board was required to terminate it. Turns out the actual hardware our instance was running on had a catastrophic failure of some sort. Otherwise, at least so I’m told, server images are usually migrated off of machines running in a degraded state automatically.

Needless to say, the very reasons for deciding against running our own hardware have come back to plague us. Granted we weren’t responsible for replacing the hardware but we were still affected by the troublesome machine. We weren’t just slightly affected by the loss of our NFS server either. Since we are running off of a heavily modified Drupal CMS our web servers depend on having a writable files directory. As it turned out Apache just spun waiting for a response from the file system, our web services ground to a halt waiting on a machine that was never going to respond… ever. Talk about a single point of failure! A non critical component, serving mainly images and photos managed to take down our entire production deployment.

This event has prompted us to move forward with a rewrite of Drupal’s core file handling functionality. The rewrite will include automatically directing file uploads to a separate domain name like csimg.com or something similar. Yahoo goes into more detail with their performance best practices. However, editing the Drupal core is generally frowned upon and heavily discouraged since it usually conflicts with the upgrade path and maintainability of the Drupal core becomes much more difficult. While we haven’t stayed out of the Drupal core entirely, the changes we have made are minor and only for performance improvements. I believe it is possible to stay out of the core file handling by hooking into it with the nodeapi but it seems like more trouble than its worth.

The idea behind the file handling rewrite is to serve our images and photos directly from our Co-Location while keeping a local files directory on each EC2 instance for non user committed things like CSS and JS aggregation caching among other simple cache related items coming from the Drupal core. This rewrite will allow us to run one less EC2 instance, saving us some money as well as remove our dependence on a catastrophic single point of failure.

For the time being we have set up another NFS server. This time based on Amazon’s new EBS product. I spoke about this in a previous post. One of the issues we had when the last NFS server went down was the loss of user generated content. Once the instance went down all the storage associated with that instance went down with it. There was no way to recover from the loss, it was just gone. This is just one of the many possible problems you can run into with the cloud. While on the pro side, you don’t have to worry about owning your own hardware, the con side is you cant recover from failures like you can with your own hardware. This is a very distinct difference and should be seriously considered before dumping your current architecture for the cloud.

I have come to the conclusion that I should be cataloging my work, thoughts, theories and activities for others to read and learn from my experiences as a web engineer. Let me begin by mentioning I work at a company called CitySquares and for the last year I have been working diligently on the current CitySquares site.

This has been a great year for me as I was given the opportunity to learn the inner workings of the Drupal CMS. While Drupal is a great CMS/Framework, it is inherently still a prepackaged CMS designed for things that 99% of the community needs. CitySquares unfortunately falls within that other 1%. I must say that we have accomplished quite a bit using Drupal’s community modules in conjunction with our own custom written ones. However, there are plans in the works that we would like to implement but just cant within the Drupal framework.

Although all is not lost. With the current iteration running and stable and gaining traffic every week I have the opportunity to turn the page and begin work on the next phase of development. This is an exciting time and I will use this medium to convey the successes as well as the issues as development here continues.

That said, we have decided to scrap our Drupal based architecture in favor of a more extensible framework, Symfony. Symfony is a PHP based OO architecture that resembles Ruby on Rails. Not only will we gain the benefit of switching to a OO style framework but we will be using Doctrine as our ORM and Smarty as our template engine.

The idea is that this combination of technologies will help us alleviate two of the major problems we have  with Drupal, essentially scalability and codability. Ive been toying with some ideas to help eliminate these two thorns in our side that I will discuss at a later time but look forward to hearing my ideas on a full stack horizontal architecture.