Startups building software products tend to think about scalability as an after thought, quite understandably. There are far more pressing things to focus on first, like identifying a product that solves a problem, for a paying and addressable market, delivering value to that market, and then collecting revenue from them for doing it. All of that must take centre stage, at least initially, particularly if you’re following a lean approach to getting started and are launching to validate an idea with an MVP. (Hint: Too much thought about scalability probably violates the “M” for minimum).
At an early point in the lifecycle of a product however, someone inevitably asks “Does it scale?” — Meaning, will the product continue to work if we have 10 times, or 100 times, or 1,000 times the current number of simultaneous users? And, more interestingly… at what point will it start to fail?
Not surprisingly, the initial answer is often “I don’t know”. And this is a concern.
So here, in a digestible nugget, is what I’ve learned about the huge topic of Software Scalability in 20 years as a software engineer, so-far (I’m sure I have much more to learn). I’ve focussed mainly on server-side scalability of web applications, and particularly those deployed on Amazon Web Services (AWS) or Heroku (or similar) here, as that’s my area of expertise and close to what most startups are looking to scale. But many of the ideas here also apply to other types of software product.
SO WHY EVEN THINK ABOUT SCALABILITY?
In two words: Customer Experience.
If your customers have a bad experience when using your product — a slow response, a timeout, an error message — they will, in all likelihood, look elsewhere. With poor scalability, you can literally lose customers, often before you’ve gained them.
The basic — and often overlooked — software engineering fact that drives the need to consider scalability is that what works well in your Development environment — usually on your laptop — and seems to do similarly well when initially deployed, perhaps to AWS or Heroku with very little traffic — and by traffic, I mean users — may not work so well when you publish your first press release, get mentioned on Hacker News, or in the days after you give your first kick-ass presentation and everyone flocks to your site to find out more. These are all happy events… but only if your product scales to meet the increased demand, which naturally requires increased resources.
Being able to offer every visiting user a reasonably good experience of your product is as key to success as figuring out what your product should be. A great product that is slow to respond is unappealing at best. A product that is hidden behind a timeout message is essentially worthless.
WHEN TO SCALE?
Ideally, just before you need to.
Metrics – Just like forecasting the weather, gathering and using metrics is the only real way to tell how your product is doing, whether it is handling the current load and to predict where the limit and the degradation in user experience might begin to occur.
Instrumenting your product code (by capturing your own timing stats, perhaps using Aspect-Oriented Programming or a third-party library), measuring throughput (hits/sec, average response time, etc) and then capturing and aggregating those metrics using a product such as New Relic, or building your own performance dashboard, are all great ways to expose the current performance of your product in near real time.
Spare Capacity – Metrics only show how your product is currently doing. What you’re more interested in is the spare capacity: How much more traffic you can handle before problems start to occur. If you are comfortably handling 1,000 requests/minute right now with a decent average response time and customer experience, but you aren’t sure whether the 1,001st request will start to cause timeouts, then you know nothing.
This is where Performance Testing, in a separate but (crucially) production-like environment, lets you figure out by experimentation where the limits of your system lie. Regularly performance testing and benchmarking your product, graphing the results and determining at what point a single server begins to give an unacceptably-degraded user experience, lets you forecast the capacity and breaking point of multiple servers handling the load between them. This, in-turn, gives you a picture of the spare capacity you currently have. Even better, baking all of this into your product metrics, and reporting a percentage capacity remaining, lets you know how close to the need to scale you currently are.
It is worth saying that any such notion of capacity needs regularly reviewing, as changes to your product, the hardware it is deployed on, or other random factors can change capacity in unforeseen ways.
Automation – Metrics alone aren’t enough. It’s no use having great metrics if, at 3am when your product gets mentioned on Hacker News, you don’t notice the hits escalating, average response times dropping and the number of timeouts going through the roof. Automation and scaling policies allow you to define conditions under which you’d like to scale up, even when you aren’t around to make the decision yourself. Unless you plan on watching them 24/7, or getting out of bed to answer the Pager Duty call, your metrics, and in-turn their notion of dwindling spare capacity, must trigger some sort of scaling action.
Cycles – When to scale is a never-ending “when”, and it is often cyclical: As well as wanting to scale up to meet increasing demand, you might want to scale back down again in quiet times to avoid wasting resources. Any scaling policy should also define how much spare capacity is too much, if only so your AWS bill doesn’t hit the roof and wipe out your profit.
Metrics, coupled with automation tell you when to scale, but not how.
HOW TO SCALE
Scaling up (often called Vertical Scaling) involves running your product on a “bigger” server, meaning a combination of more processing power (CPUs / cores), and/or more memory. How much more of each you require is determined by which limits you are hitting. If your product is CPU-bound when under load, you’ll need more processing power. If it hits memory limits, then add more memory. Usually, a mix of both is required to scale up effectively.
Scaling up is easy to accomplish because it requires no architectural changes to your software. You just deploy your product to a larger instance (on AWS or Heroku), or buy a bigger server if you are hosting the product yourself. The only problem is that it needs to be a pre-planned affair, unless you can seamlessly upgrade your instance as the need for additional resource grows.
Scaling up should, however, be your first move. You should be running your product on a box that will — on its own — meet projected capacity for a while without needing to scale further.
Your next move is to scale out (often called Horizontal Scaling). This involves distributing traffic across multiple servers. Hosting providers such as AWS provide load balancing technology to achieve this, spreading the load across however many servers you currently have running.
Scaling out introduces some architectural challenges but, unlike scaling up, it can be adaptive: You just add additional resources to meet demand, and remove them again when you no-longer need that capacity (though there are challenges in scaling down too; see later).
One of the main challenges with scaling out is that the state of your application is no-longer held in memory within one server, and is distributed between them. If you are caching certain information, for performance reasons, one server’s view of that information may be out-of-date if a change is applied via another server. Simple solutions involve techniques such as “sticky sessions”, whereby individual users are always load balanced to the same server instance, whilst it is running. AWS offer this as an easily-configurable option. But the state of your application may not be easily subdivided by user. You may need to use your database as a place to update shared data transactionally, such that race conditions don’t cause you to lose or mis-calculate data.
Other things to consider are the use of a message bus (such as RabbitMQ or Amazon’s SQS), such that information may be shared between servers; even something as simple as telling all instances when to invalidate specific data in their cache, can be easy to implement via messaging.
The architectural challenges introduced by scaling out are beyond the scope of this article, but the benefits of tackling them and being able to add spare capacity whenever you want are certainly worth it.
Incidentally, I mentioned scaling down being a challenge. You can’t simply remove spare capacity from behind a load balancer, because the instances may be in the process of handling a request from a user, resulting in an error or timeout. AWS now offer “connection draining”, such that instances that are about to be terminated are given a configured period of time to finish handling existing requests, but no new requests are routed to them. This makes scaling down as effortless as scaling up, if correctly configured.
So-far, both methods of scaling have involved either resizing or duplicating servers running the same software, and performing the same role. However, certain activities performed by your product may form a larger part of its performance bottleneck, such as expensive computations. It may make sense to scale portions of your product separately from others, such that the spare capacity and scalability of one aspect may be controlled separately from that of another aspect.
This might be most easily achieved by separating the product into Services, and scaling the number of instances of each of those services separately to meet demand. Some services may share the same process, but others may run stand-alone, perhaps connected to the other services via a message bus, socket connection, or just by virtue of sharing the same database.
If certain services perform computationally-costly activities, you could assign them to compute-optimised instances and scale the number of instances based upon the size of the queue of waiting work, or the average turnaround time for tasks, or another metric you’d like to optimise.
Separating concerns in this manner becomes more cost-effective as the overall size of a product deployment increases. There is little point, initially, in doing this with a small deployment and everything should probably run in the same process, or at least one of several identical processes, at first.
DELAYING THE NEED TO SCALE
Sometimes, it is entirely possible to delay the need to scale. In web applications, if certain computations — particularly those involving just aggregation or simple processing before presentation — can be offloaded to the browser, without increasing bandwidth or affecting user experience, this can save server resources and means the server-side cost per user is reduced. This, in-turn, delays or reduces the need to scale.
Where activities are not time-sensitive, scheduling when they occur to coincide with times of lower traffic, can also reduce to server resources required at any given time, and delay (or entirely avoid) the need to scale.
Software Scalability is clearly a huge topic. I hope this is a useful overview, particularly if you don’t come from a software background, and that it leads you to delve deeper into specific areas. At the very least, you might start building metrics into your product and use them to figure out when Scalability is likely to become a more pressing concern for you.