The perfect architecture for startups: Monoliths or Microservices?

The debate on monoliths vs microservices is ever going, this article will take you through different takes on both monoliths and microservices to understand which architecture is best suitable for start-ups and when.

The perfect architecture for startups: Monoliths or Microservices?


At the Sandbox Conference in July 2022, Arnav Gupta, Product and Strategy at Scalar Academy, spoke about microservices vs monoliths in the context of start ups. This article will take you through the session and learnings on how to set up a good architecture.

Microservices or Monoliths?

We start with an exercise Ahmad Malvia did on Twitter in the early part of 2022 and called it The Weekend Dev Puzzle. This puzzle is not directly related to microservices but contextually refers to the same. 

Let's start with scenario A

Imagine we have multiple web servers paralysed over a high availability load balancer and talking to a single database. Now, let's say we decide to change it to a setup which looks a little like this

Now, we extract out the service that talks to the Database as a separate service, we put that behind another load balancer, and on the user-facing side of things,  we separate the servers.

This is a typical pattern many people probably use when just starting or distributing a database. Another thing that many people do, and usually not at a very early stage, but if they have the SRE capability and know how to do that, is using the Smart Client, which is the topology of all the clients.

The Smart Client is embedded inside the web server and will not call the load balancers but will call one of the microservices directly based on the current network state and whichever is the nearest edge available.

We can do this either way - Do the topology aware think or Do it directly.

Ahmad Malvia posted all the above three photos. He asked the following question

“Can you tell me what happens to the availability of my entire stack when I go from case B to case scenario C? Given that, let's assume that so far, our load balancer has never failed to date, but it cannot have 100% availability”. 

To answer this, let's say the design availability of the SLA is 99.95%, and our web servers fail for, let's say, three hours every month when there is downtime, usually so that we have some parameters. We can take any assumptions there, depending on our current use case.

So what will happen when we do this transformation? What happens to our availability? Out of scenarios A, B, and C, we need to guess which one has the highest availability.

Scenario A, which is a load balancer with three or more servers

Scenario B, where we extract out the DB service behind a load balancer

Scenario C, i.e. the one with a smart client

We need to understand what are the other factors that we can also introduce other than a default status.

Let's say for every layer we introduce; we would need to go below one because every layer would be down to 99.9 of the previous one. So, every time we multiply, it's going to go lower. 

Going by that, one might assume that we should never make services and always have monoliths everywhere. But we must ask ourselves about the threshold where we can take that decision and the factors that can come in a while. There is an excellent mathematical explanation, given by Ahmed Malviya.

Let us consider the probability numbers for the availabilities like P(LB) to be the network availability because there's a network boundary, and between every network boundary, there will be a P(N). To simplify this, we will consider that every load balancer that we introduce has a similar operational constraint load balancer. 

We will have similar constraints between each boundary just to simplify the match. When we talk about the availability of the web service, it is just one web service with a single value. Splitting it up would be a separate availability value for the microservice we extract and a separate one for the web service. A small caveat here is that not necessarily all calls will go through the entire stack. But considering that the call switch goes through the whole stack, our bottom line would be that there will be some calls, which will just be replied to from the top layer service, and thus, the availability problem is avoided. 

In scenario B, we can see P(N) gets multiplied way too many times because we cross the network boundary four times here. In scenario C, a fascinating factor is that if you go for a topology-aware client, who knows which service to hit, it will also have an availability factor, although the load balancer goes away. The P(SVC) would like the upper bound of P(N) since whatever the availability of the network, it needs the network to know which microservice to hit by calculating the upper bounded of P(N) as well. 

Some assumptions

We take some assumptions here, like taking a load balancer at 99.995 and taking the network, i.e. web service, at 99.6%.


Taking these values is where the rubber meets the road, i.e. deciding on the threshold factor and whether such a movement helps.

If we refactor it well, splitting the service will increase the availability of each component, and refactoring it in the wrong way will not do so.

It is considering two scenarios, one where it goes up by point 1% each time and the other, when it goes down by 1% each. When we multiply the different final values, it comes down to net availability.

For both Scenarios B and C, if the availability increases slightly by the refactoring, and if it decreases slightly by both, it goes below 99%.

If we crunch the numbers, we will see that it will start making sense after splitting the service, which had 99.6% availability, considering that both have at least a probability of 99.85%. 

After doing the split, our availability will not fall.

There is mathematical backing to decide the layers will introduce the lower availability when we split. Then we could be refactoring the code better where there may be a net gain. Smart clients can be better than a normal load balancer if their algorithm can be more reliable. 

The saga of the user


The next thing is about another story of user services which is quite common, and startups follow when they start. 

One of the first services which generally get extracted out is user authentication service because there are common cases, like video OTT platforms, community forums, and a chat support platform, all of which use the same user account, which requires an SSO. 

Here, there are a bunch of different servers; all we have to do is ensure that there is the usual OAuth flow, and this gets pretty easy to understand if we have worked on any web services before.

When we try to log in, we will get redirected to the auth server, and this auth server issues a token, which will be a private token exchange between the client-server and the authentication server. 

This generates a final auth token, a link to the forum that absolves the token and is backed by an OAuth token internally if we explicitly do the auth and not just for the frontend.

Once this is done, we will be doing just basic token-based authentication. For example, for every request we make, there will be a communication between these two services to validate the authorisation. Validation is something that will happen when authorisation happens for the first time, but that request is going to happen every time.

There are many ways people work around the problem of extra latency; a fairly common way to do so is redesigning the auth differently. This is where generally, our JWT-based authentication comes in handy. 

Auth Service vs Auth Library

Usually, the auth product is two things – both a service and a library. Our library can do stateless validation of the token without hitting the Database or the auth service but creating services and tokens need all the services. When we set up this way, we end up with pros and cons. 

In terms of auth service, some pros are that it stops being a single point of failure because in the previous cases, it remains a single point of failure, but here it stops being so. If it is down, new users won't be able to log in or register, and all of these will fail. But for, users who have valid tokens can continue to use our client services, and this can experience a bit of downtime without users.

This can result in login crashes; for instance, we see users complaining on Twitter at times - “I'm unable to log in, but I'm already logged in to my app”. 

This is a classic case of cryptographic-based vs database-based tokens, and serverside invalidation is something we might lose by doing this. 

Cutting across Seams

One of the cases in which we might need mathematical validation regarding availability is - “What kind of trade-offs should we do when creating services?”

Taking a theoretical instance, consider a startup – when we start building things, a time comes when we have to decide to extract the services, and there are various ways by which we can approach the problem. Let us take a look at what kind of pros and cons happen on that. 

For example, we have a blogging service. Taking a general approach, we have the standard layers - the presentation layer, a data layer, and a domain layer. 

Let us say that a blogging service API will look like this; we will have a bunch of controllers doing authentication – users, articles, commands, and repositories for all these three. Then we would have certain services, some of which might be doing things that are not front-end endpoint related, like moderating commands. It could even be a worker running moderating comments and removing comments which might have, for example, ABC words inside. We might even have a field service, which creates the cache of the article feed based on some Machine Learning operations. We usually split it into services if we have an API gateway by looking at it in a more layer-by-layer method. If we have services, we might club certain business logic into some services depending on the granularity of the split. 

Let us assume a single data service that is running on the database. Unless we need to go into distributed dataset databases, we can continue having a single data service. 

This is where Conway's Law comes in - technology companies ship their organisation structure and use a very high pod structure. This generally starts happening at a larger scale as well. But at a startup, it depends on the scale at which we start ending up in pods. This usually is correlated with the velocity of product manager hiring. Once you have product managers, they create their fee of dumps, and pods get created automatically, which can be looked at in a feature-wise approach.

On splitting, we end up with something a little different. However, let us say that we have a very thin proxy now at the front, and we then cluster the entire stack of services, where you have the controller and the service. Another pattern that is becoming common is making libraries instead of services. By doing so, we make our data access a library, and this enables us to embed the library simply. This makes it possible for the same team who probably manages the data service not to need to manage a running service. They might just need to manage the data library, which every client team can use. 

This might give rise to other challenges like Concurrency Challenges that we need to ensure that the library is always in the context of a different app. Apart from these challenges, this lets the team who made the library use the same methodologies in context for different programs like dynamically linked libraries for desktop apps, etc. 

However, the code will not change as a result of this. For example, let us say a spring boot application has some common features like – a bunch of controllers, a bunch of services, etc., have not changed any of those smaller boxes inside that; hence, the code does not change in that sense, and it is all about how we look at the split: 

Here, we lose a lot of serialisation and deserialisation overhead by going this way, and we retain the data until the API goes through the same process.

Between the service boundaries and the serialisation and deserialisation, we must save such boundaries between all the setups involved. We will have a simulation distillation boundary here, which can be quite a lot of overhead, depending on the data we are working with. The other pieces, like showing public articles and the comments on that, are something that we are heavily banking on SEO. A lot of hits will be coming, and maybe 90% of our users are not logged in; we have a very independent service in which we can invest a lot in availability, and better cache invalidation strategies. Even if authentication is down, the web crawlers that we are working on will let our content network up. So for anybody deeply into the content game, like garnering a lot of people via SEO and organic channels, this setup is essential because the team working on articles can optimise the articles cache layer separately for their use case better. Again, this will have a different set of trade-offs.

Based on these three examples - the actual overheads of going with microservices are significant. Keeping these in mind is very important while starting with the microservices that we talked about – introducing an availability overhead each year and the serialisation and deserialisation overhead which is something less talked about but is an important part and can impact us depending on what language we are using as well because different tech stacks get affected in different ways by isolation.

Now, every time we introduce rigid boundaries, our documentation demand goes up whether we document it or not. Even if we’re a zero documentation place, the ideal amount of documentation goes up with every boundary we introduce and the observability challenges. With larger services that we call micro services, let us say, a pocket of the unobserved codebase, monoliths do not work out because if it is part of a service and gets general observability. This is another overhead that we have to take into account. Sunny will be talking about when we introduce a microservice, there are a lot of checklists that we need to go through, and all of these are part of those checklists. So every time we introduce a microservice, we need to pay for all the other costs or fall into trouble. 

If we do it on a larger scale, introducing a new microservice also increases our on-call total man-hours, and if we don't design it properly, we introduce a unique single point of failure.

What should startups do?

If you have started with microservices, this might be an honest mistake. The suggestions would be, to begin with, monoliths and then start extracting services out of some of the easy pickings, where you don't need to apply a lot of thought processes like dumb labour kind of services, which is image upload video transcode.

This kind of pipeline doesn’t need a lot of context for that system to operate or user context, a vendor ID, product ID, etc., but pure labour like just fade in and out pure function kind of services. Side effects are something that we can do, which is like logging, auditing, and pushing things to cold storage. These are not part of our core workflow anyway, so we can start pushing those out.

Post-processing is after images have been uploaded; we want to run some machine, computer vision to delete an image if there is any obscenity. These workflows use a pretty standard process like putting a queue. Another reason we might want to extract things out is if we have spiky workflows, for instance, an online test; when the test starts, many OTP (one-time password) requests pop up. So, wherever it's spiky, extracting out means we can handle it, i.e. scale up and down separately. 

Another thing for people to consider who are thinking about microservices is if they can extract out something that is sacrificing things and which you can kill but won't stop your core flow. For example, during IPL, when they have a lot of loads, they sometimes contain the recommendation engine, so everyone gets the same feed since 90% of users want to see IPL, and they do not care about the recommendation system. So they put IPL at the top, and everything else is the same feed; everybody gets not personalised.

This is necessary as  Machine Learning (ML) workflow requires a lot of computing power, and they can save some of that by turning the recommendation system off. 

A few things to keep in mind are guidelines for the tech team.

  1. Try to keep the engineer-to-services ratio to N:1 and not 1:N. As 10 people and 20 services will make things a mess very soon.
  2. We should ideally always have fewer services so that we can have people who have one-on-one ownership of services. Let us call some services es exotic services, which are just services that can bring a lot of things down, like API gateway, and the service registry. For these services, we also use the same logic, i.e. N:1, and think about the number of senior engineers on the team.
  3. The star on the senior engineers indicates that three-year-old engineers are not senior engineers but those who can fix stuff within half an hour of a phone call. So, apply the same principles to the exotic services mainly on boundaries, which can be single points of failure and have at least one senior engineer per such service. These guardrails can help like some rules to prevent us from microservice hell.
  4. Creating services shouldn't be very cheap, in the sense that anybody cannot spin up a microservice without following the guidelines very quickly because then running into the observability problem later when services are running. If they come down, at some point, and there is an instance where a junior engineer who had created the service left the company and has not documented that the service was running, it comes down five years later.

Further reading

Here is some further reading that you might refer to later.

An excellent article called MonolithFirst by Martin Fowler and another article by Martin Fowler called How to Break Monolith to Microservices. These are excellent articles for anybody considering scaling up their startup’s tech and should give them a read.

Another excellent article is Segment which talks about how they went to microservice, and they came back; and is an excellent read because there is a lot of literature available on why we should use microservices but very few real examples of what problems people run into when they are using microservice at a vast scale. Segment wrote a lovely blog article and series of blog articles about what problems they got into and why they scaled back the microservices.

Finally, Lyft has a four-part blog for people who use microservices on how they work on their developer experience with microservices. This article is cohesive and would be a good solution for how we need to have charming local development environments for the microservices, with the boundaries adequately tested so that people can work end to end. In the usual case, this works locally but does not work in production. Lyft has a lovely blog about how they do it to solve this. 

Diksha B Patro
September 21, 2022