Code Recipe

Dec 5, 202110 min

Netflix System Design - A Complete Guide

Updated: Jun 3, 2022

Problem Statement

Design a video streaming service like Netflix, Amazon Prime, Youtube. For your Netflix system design you can assume our platform needs to support a total 1 billion users, with 500 million being daily active users.

Functional Requirements

Our platform should allow users to upload videos.
Users should be able to watch videos on our platform.
Users should be able to search for videos.
User should be able to see likes/dislikes, view count for videos.

Non-Functional Requirements

High availability.
Our system should be reliable (We should make sure uploaded videos are not lost).
Low latency (Users should not experience lag while watching videos).
Consistency is important, but not at the cost of availability. It's okay if one user sees a video little later than the other user.

Netflix System Design

Netflix system design is a common question asked in interview these days. Before we begin with the Netflix system design it is important to understand the general behavior of the system we are trying to design.

Systems like Netflix, Youtube etc are generally read heavy i.e. there would lesser number of people uploading the video as compared to people viewing it. So there would be higher number of reads on our database or storage where we store our videos(a and video metadata) than writes to it. We need to keep this in mind while designing our system.

We will be using microservice based architecture to design our system as compared to a monolith because of the obvious benefits.

Now coming to the choice of storage, we have different kinds of data to store like user data, video metadata and the video file itself.

Where does our onboarding service save the uploaded video? Can we store this in a database? It is generally not a good practice to store video files in a database because this can put a lot of pressure on our database and consume a of lot of bandwidth since fetching a video can be a slow and heavy operation(in terms of bandwidth).

Distributed storage like amazon S3 are designed to store flat files like images and videos. So we would store our actual video file on S3 and store only the link to S3 on our database.

Coming to user data and video metadata, data, considering the size of the user data may not be that huge we could store this on a SQL database. But we are talking about 1 billion user here, the amount of data is still considerable and can only grow overtime. Using No-SQL database would provide us better flexibility in terms of scaling horizontally or sharding if there comes a need to do so at a later point. Also looking at the use case we don't see a requirement to make complex relational queries or ones requiring joins here. So no-SQL would be a better choice.

Which No-SQL solution would be a better choice for our use case?

Since we know read to write ratio is high for our use case, a wide column database like Cassandra which is optimized for fast reads would suit our needs. Also since we are looking for high availability and consistency can take a back step in the interest of availability for our use case, Cassandra makes a ideal choice. Having said that it is important to note that Cassandra does promise eventual consistency.

High Level View

At a high level our service need to support the following main functionalities:

1. Upload videos

2. Search for videos

3. Stream videos

To start with let us have three microservices to handle each of these responsibilities. We will call them onboarding service, video search service, streaming service. Let's see each of these functionalities in detail.

Onboarding Videos(Uploading Videos)

Uploading a video would be done by content creators. In case of a service like youtube it could be the individual creators, in Netflix it could be done by production houses. Also the length of the uploaded videos might vary slightly from platform to platform. We would not get into the details of user type discussion here, also we will assume the videos uploaded are fairly large(movies, web series etc). In either case this should not majorly impact the way we design our system.

When the user uploads a video, the request hits the onboarding service. The onboarding service is responsible for uploading this video into s3. Uploading to a object storage is important because,

If the size of the video is more, this would consume a lot of resources(memory) if it was to be stored in memory in video processing service during processing.
It would serve as a back up in case our video processing service goes down due to some failure while processing the video.

Object storage services like amazon s3 are mostly distributed services, so the chances of the uploaded videos getting lost is extremely low.

After this the onboarding service saves the video metadata to our metadata database.

Video Processing

Once the video is uploaded to S3, we need to process the video further before making it available to our viewers. Processing a video can be a time consuming task, we may have to pass the video through multiple intermediate stages before it is ready for consumption for viewers.

It would be a good idea to distribute various responsibilities involved in processing a video into different microservices because some stages of the processing might be computationally intensive and take more time, other stages might finish faster. So having a microservice to handle each stage of processing gives us more control if we had to selectively scale specific stages to improve performance. For instance, for the stages that are slow and require more time, we can have more replicas of that microservice to handle the job. Other stages that are faster may not require that many replicas. Also having a microservice to handle each stage helps in parallel processing, a faster stage does not have to wait for a slower stage until it finishes its job, after a stage finishes its job it can simply put that event in queue and take up the next task. Once the other stage is ready it can poll it from the queue. Doing a parallel processing of stages reduces the overall time needed to process the video.

Video Splitter Service

In system like Netflix, the size of the uploaded videos can be very large, and we would not

want our microservices processing the video to have the whole video in memory for the duration of processing.

Processing the whole video can increase the time needed to process the video. If the processing fails midway for some reason we would have to process the whole video again. Also having a large video in memory during processing can be expensive in terms of memory. Therefore we need to split the video into chunks and do our processing on these chunks, rather than processing the whole video.

Splitting the video this way helps distribute the work and also process in parallel. In case of a failure we don't have to start from the beginning but start from the chunk until which we have successfully processed.

Client apps these days have the ability to request for video of various quality and formats while viewing the video, dividing the video into chunks helps better the viewing experience as our service doesn't have to respond with the whole video again, it only needs to respond with the chunk for the requested quality and format that the user has requested.

Let's call this service responsible for dividing the video into chunks as video splitter service.

The video splitter service is responsible for dividing the given video into multiple chunks of smaller size. Once the video is divided into chunks the video splitter service uploads these chunks into S3. Also it updates the split video metadata in the metadata database.

But you may ask how does the video splitter know when the video is uploaded to S3. And also how does the next stage(microservice) in video processing know when the video splitter service has successfully completed its job?

That is a valid question. One way to achieve this is, onboarding service can explicitly notify this to video splitter service via a REST or gRPC call when it finishes its job, same can be done by video splitter service and so on. This approach works fine and should serve the purpose. But the problem with this approach is that our microservices needs to be aware of the next microservice in the video processing queue.

Also this approach tightly couples the video processing stages to the microservices involved, which is something which we might not want especially when we have to scale our services later.

Another way to achieve this is using a pub sub mechanism using a distributed queue like Apache Kafka. Each microservice that needs co-ordination in this case needs to register with Kafka queue. After a microservice has done its part of processing it registers a completed event in Kafka queue. The microservices in the next stage of processing poll/listen to these events and take it forward from there.

In this approach each microservice need not know other microservices in the processing stage, neither do they need to communicate with them. Microservices only communicate with Kafka and hence this would also scale well. If tomorrow we had to add another microservice into our video processing sequence, all that we have to do is add another microservice to handle the processing for that stage and register it with Kafka, the next service can listen to this even and take it further from there. We need not have knowledge of other microservices or what they do.

Video Encoder Service

The next step in our video processing sequence is to encode these chunks. We can have another microservice, lets say video encoder service, that handles this part of the processing. The video encoder service is responsible for encoding the video into different formats, quality and resolution. This is necessary because,

Our video will be viewed on multiple types of devices.
By users with different network bandwidth, so users might request for video of various quality depending upon their network speed.

Once the encoder service successfully encodes each of these chunks into various combinations of video format, quality and resolution it registers a completed event in the Kafka queue. It updates the encoded video metadata in the metadata database.

Content Verification Service

The next stage in our processing is verification of the contents of the video. The content verification service is responsible for checking the video for restricted content like violence, nudity etc. If this check fails a notification will be sent to the uploader of the video along with the reason for rejection. If the video is successfully verified, content verification service registers a completed event in the Kafka queue.

Video Tagging service

Video tagging service is responsible for creating tags for uploaded videos. These tags could be created based on the description provided by the client during video upload among other things. All the created tags are then pushed to the elastic service. These tags can be used at a later point during video search, video recommendations etc. Once it finishes its task the video tagging service registers an event in queue. Also it uploads the fully processed video into S3.

Note: We have covered some of the general steps for processing a video. Actual stages may vary for depending on the requirements. The idea is to give a general perspective on how this requirement can be handled irrespective of the stages involved.

Video Distribution Service

So now we have a fully processed video ready to be viewed by ours users in S3. A platform like Netflix or youtube has a global audience. If our servers and infrastructure is located in United States, someone requesting for a video from India for instance would face a lot of delay because of the network round-time. To serve our audience worldwide without delays, it may not be feasible for us to place our servers in each of these locations. This is where the video distribution service comes into play.

Video distribution service is responsible for picking up the final processed video and pushing it to our caching servers(CDN). There are several strategies how we can cache our data in these CDN servers located worldwide. We can employ a push model where we push the data into these CDN servers. There is also a pull model, in which data is pulled into the cache servers when the user requests for it the first time. Both approaches have their own pros and cons. The choice to select between the two really depends on what suits our requirements.

Suppose we decide to go with the push model, it is important to consider that it may not be a good idea to push our content to all CDN servers. For instance, a bollywood released in India being cached in African CDN might not be of that much useful. The video distribution service should intelligently push these video to specific CDNs based on some intelligent algorithm. One such way may be to ask the uploader to provide target audience for the video during upload and we push only to CDN servers in provided as target regions. For a region where the content is not cached in a CDN, the first user might experience some delays while requests for the video since it has to be streamed from our servers, but later ones wont see the delay since the request would be served from nearest cached CDN server.

Want to master coding? Looking to learn new skills and crack interviews? We recommend you to explore these tailor made courses:

Video Search

Video search is made simple because of the processing and tag creation done by Video tag service. Whenever a user searches for a video it is passed to the video search service and this service talks to elastic search. Remember we have provided elastic all the tags it needs during video tagging process. Elastic search provides out of the box solution for searching the given text(JSON input/output).

Streaming video and managing likes, dislikes, views count

When the user requests to view a video, the request is first sent to the nearest CDN server. If the video is cached, it is served from CDN. Also a copy of the video metadata containing the total likes, dislikes and views for that video is sent to the user. If the video is not cached in CDN, both the video and metadata has to be obtained from the origin server. The request hits the video streaming service which queries the metadata database which contains the URL for the requested chunk of video. The video streaming service then pulls this chunk from S3 based on the URL obtained from metadata database.

Also since this data was not found in the CDN this can now be cached in the CDN server so that the next user requesting for this video doesn't have to reach the origin server to get the content.

Below is the final design diagram:

Netflix System Design Diagram

Summary

So did we cover all the functional and non-functional requirements? Our design supports uploading, streaming, searching searching videos and also viewing the like/dislike , view counts. So functional requirements are covered.

Our system is reliable, Amazon S3 is promises to provide 99.999999999% durability of objects over a given year. So the chances of a video getting lost is very low. Pushing our content to CDN servers distributed across geographies ensures our users face minimum latency while streaming videos. For availability we have multiple replicas of a service running so that if one of them goes down it doesn't result in downtime for the users. Also we can take availability to another level by having multiple data centers distributed across regions, so that even if a data center in a goes down users can still be served from the other available data center. They might experience slight delays due to network round time but our service will still be available.

So this covers all the functional and non functional requirements mentioned. Let me know if you have any questions by commenting below. Make sure you checkout our other posts.

That is all for this article, thank you for taking your time to read this. If you have any questions or doubts, please let us know in the comments section below, we will be happy to answer you.

You can explore more such amazing articles from code recipe in our blogs section.

17590