99designs hosts design contests, and has been growing rapidly year on year. This post gives an overview of the infrastructure which powers our site, how we make sure it continues to hum smoothly, and the challenges we face as we scale.
A little context
Since its humble origins in ad-hoc contests within Sitepoint’s forums, 99designs has turned out to be one of the success stories of the Melbourne start-up scene. Although we now have offices in both Melbourne and San Francisco, the development team is in the Melbourne office, close to where the action all began. Our team here has 8 devs, 2 dev ops, 2 ux/designers, and is expanding. Half the people here arrived within the past year, myself included, amounting to a lot of growth and change over a limited amount of time.
Our site sees hundreds of thousands of unique visitors a month, generating pageviews in the tens of millions. Since we deal with graphic design, many of our pages are asset heavy — these pageviews fan out to some 40 times as many requests. Whilst there are many larger sites on the net, we thought this was enough to warrant sharing the way we do things.
Requests in layers
The easiest way to describe how we serve requests is to talk about it in layers, each of which solves a set of problems we face. We’ll cover six pieces of this puzzle: load balancing, acceleration, application, asynchronous tasks, storage and transient data.
Let’s start at the beginning of a request. When a user visits 99designs.com, the request firstly hits our Elastic Load Balancer (ELB). The load balancer is a highly reliable service which ensures that requests are spread evenly between the Varnish servers beneath it. It also performs active health checks, so that requests only hit healthy servers, and SSL unwrapping, allowing us to work with an unencrypted stack from there on down. On the SSL front, using a separate ELB for each domain turns out to be a convenient way of running multiple secure domains.
Our acceleration layer consists of several Varnish servers, which allow us to serve a large amount of media with only a limited app stack beneath them. We have a long-tail of static media, so we run Varnish with a file-based rather than in-memory storage backend. Varnish is fast, and incredibly configurable through its inbuilt DSL. Furthermore, its command-line tools for inspecting live traffic are second to none, and are incredibly useful in tracking down odd site behaviour.
We strive to have a responsive site where users aren’t kept waiting, but often requests might need to do some extended work, or access an external API. Integrated with our application layer is an asyncronous layer which tackles this problem. We queue up tasks using Beanstalk in-memory task queues on each of our app servers, using the Pheanstalk bindings. Beanstalk is known to be lightweight and performant, with the trade-off that we lose some visibility into the immediate contents of our queues. A pool of PHP workers listens to these queues, and takes care of anything lengthy or requiring access to an external API. Tasks which need to run at a particular time are stored instead in the database, and added to the queues by cron when they fall due.
Our storage layer features Amazon’s managed MySQL service (RDS) as the primary, authoritative and persistent store for our crucial data. An RDS instance configured to use multiple availability zones provides master-master replication, providing crucial redundancy for our DB layer. This feature has already saved our bacon multiple times: the fail over has been smooth enough that by the time we realised anything was wrong, another master was correctly serving requests. Its rolling backups provide a means of disaster recovery. We load-balance reads across multiple slaves as a means of maintaining performance as the load on our database increases. For media files and data blobs, we use S3 for redundant and highly-available storage, with periodic backups to Rackspace Cloudfiles for disaster recovery.
Aside from our database proper, there are three services which we use primarily for transient data: Memcached, MongoDB and Redis. Memcached runs locally on every application server, with a peering arragement between servers, and helps us reduce our database queries dramatically. We log errors and statistics to capped collections in MongoDB, providing us with more insight into our system’s performance. Redis captures per-user information about which features are enabled at any given time; it supports our development stragegy around dark launches, soft launches and incremental feature rollouts.
Software as infrastructure
99designs strongly follows the “software as infrastructure” mantra. Like many companies now, we don’t own any hardware ourselves, preferring to remain flexible, and relying heavily on Amazon’s cloud offering. Growing as we have has meant a lot of change in a limited period of time, and has built into our culture a distrust for documentation and the dual-maintenance problem it creates. Instead, we focus on automation of as much as possible.
We currently use Rightscale to manage our server configurations, which basically amounts to using a managed form of Chef for provisioning new servers. We make sure each server type has a recipe which allows us to spin up replacements at a moment’s notice. This means we can treat servers as disposable, and mean it.
The layers and services which make up our infrastructure amount to a fair number of moving parts, so monitoring and keeping track of the distributed application state is important. We do that through a number of services, incuding a large number custom monitoring pages, NewRelic, CloudWatch, Statsd and others. Two large monitoring screens feature prominently in our office, making sure we’re aware of changes to the site. Despite all this information, we’re continuously working to get a better understanding around site behaviour and performance.
Whilst the team here has some pride in our accomplishments, there’s a lot we still have to work on. Here’s some of our biggest challenges:
- Scaling back infrastructure, rather than just scaling out. As our site changes and our customer base grows, the load we place on our backend systems can vary dramatically. One way to deal with this is to over-provision, so as to meet such spikes without issue. A challenge for us is to automate and stress-test even more of our infrastructure, so that we can bring up new servers even faster and more reliably. This would allow us to confidently reduce capacity when we have excess, rather than simply expanding.
- Providing a strong experience for international customers. We have a diverse and international customer-base, yet all the action is currently served out of Amazon’s US-East data center. This leads to quite a disparity in customer experience. We’re currently trialling CDNs in order to get static media to our international customers faster, and likewise looking at other ways we can improve performance.
- Balancing feature growth with stability. Being responsive to our customers means being able to push out new features quickly. In some companies, this causes a tension between developers who need to get code out, and ops who are woken in the night by the consequences of a hasty change. We’re attacking this problem from multiple angles: stronger acceptance testing should give us better sanity checks on new code that goes out; feature flipping allows us to incrementally role out new features to only a subset of users; and finally, we’re working on further automation in order to allow our developers to be more active in our infrastructure, meaning they can really own a change from the moment it’s coded to the moment users see it in production.
Watch this space
This post has given an overview of our current stack, and some of the broad challenges we face, but we’ve got a lot more to say about our development style, and the things which make our culture. We’ve benefited greatly from the open source community and the expertise of those who share their experiences. Now that we’ve grown, we’re keen to give back a little too, in the hope that others can benefit from what we’ve learned.