qFMBHQqm

· 6 years ago · Dec 15, 2019, 04:52 PM
11
2Contents
3System Design: A step by step guide.................................................................................................................................... 2
4Designing a URL Shortening service like TinyURL ............................................................................................................. 6
5Designing Pastebin..............................................................................................................................................................16
6Designing Instagram ...........................................................................................................................................................22
7Designing Dropbox..............................................................................................................................................................31
8Designing Facebook Messenger .........................................................................................................................................41
9Designing Twitter................................................................................................................................................................49
10Designing Youtube or Netflix..............................................................................................................................................58
11Designing Typeahead Suggestion ......................................................................................................................................67
12Designing an API Rate Limiter............................................................................................................................................75
13Designing Twitter Search....................................................................................................................................................83
14Designing a Web Crawler....................................................................................................................................................88
15Designing Facebook’s Newsfeed ........................................................................................................................................96
16Designing Yelp or Nearby Friends ...................................................................................................................................104
17Designing Uber backend ...................................................................................................................................................114
18Design Ticketmaster (*New*) ..........................................................................................................................................119
19System Design Basics ........................................................................................................................................................131
20Key Characteristics of Distributed Systems ....................................................................................................................132
21Load Balancing ..................................................................................................................................................................135
22Caching...............................................................................................................................................................................138
23Data Partitioning ...............................................................................................................................................................140
24Indexes ...............................................................................................................................................................................143
25Proxies................................................................................................................................................................................145
26Redundancy and Replication............................................................................................................................................147
27SQL vs. NoSQL....................................................................................................................................................................148
28CAP Theorem.....................................................................................................................................................................151
29Consistent Hashing............................................................................................................................................................152
30Long-Polling vs WebSockets vs Server-Sent Events.......................................................................................................155
312
32System Design: A step by step guide
33A lot of software engineers struggle with system design interviews (SDIs) primarily because of
34three reasons:
35• The unstructured nature of SDIs, where they are asked to work on an open-ended
36design problem that doesn’t have a standard answer.
37• Their lack of experience in developing large scale systems.
38• They did not prepare for SDIs.
39Like coding interviews, candidates who haven’t put a conscious effort to prepare for SDIs,
40mostly perform poorly especially at top companies like Google, Facebook, Amazon, Microsoft,
41etc. In these companies, candidates who don’t perform above average, have a limited chance
42to get an offer. On the other hand, a good performance always results in a better offer (higher
43position and salary), since it shows the candidate’s ability to handle a complex system.
44In this course, we’ll follow a step by step approach to solve multiple design problems. First,
45let’s go through these steps:
46Step 1: Requirements clarifications
47It is always a good idea to ask questions about the exact scope of the problem we are solving.
48Design questions are mostly open-ended, and they don’t have ONE correct answer, that’s why
49clarifying ambiguities early in the interview becomes critical. Candidates who spend enough
50time to define the end goals of the system always have a better chance to be successful in the
51interview. Also, since we only have 35-40 minutes to design a (supposedly) large system, we
52should clarify what parts of the system we will be focusing on.
53Let’s expand this with an actual example of designing a Twitter-like service. Here are some
54questions for designing Twitter that should be answered before moving on to the next steps:
55• Will users of our service be able to post tweets and follow other people?
56• Should we also design to create and display the user’s timeline?
57• Will tweets contain photos and videos?
58• Are we focusing on the backend only or are we developing the front-end too?
59• Will users be able to search tweets?
60• Do we need to display hot trending topics?
61• Will there be any push notification for new (or important) tweets?
62All such question will determine how our end design will look like.
63Step 2: System interface definition
64Define what APIs are expected from the system. This will not only establish the exact contract
65expected from the system, but will also ensure if we haven’t gotten any requirements wrong.
66Some examples for our Twitter-like service will be:
673
68postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, …)
69generateTimeline(user_id, current_time, user_location,
70markTweetFavorite(user_id, tweet_id, timestamp, …)
71Step 3: Back-of-the-envelope estimation
72It is always a good idea to estimate the scale of the system we’re going to design. This will also
73help later when we will be focusing on scaling, partitioning, load balancing and caching.
74• What scale is expected from the system (e.g., number of new tweets, number of tweet
75views, number of timeline generations per sec., etc.)?
76• How much storage will we need? We will have different numbers if users can have
77photos and videos in their tweets.
78• What network bandwidth usage are we expecting? This will be crucial in deciding how
79we will manage traffic and balance load between servers.
80Step 4: Defining data model
81Defining the data model early will clarify how data will flow among different components of
82the system. Later, it will guide towards data partitioning and management. The candidate
83should be able to identify various entities of the system, how they will interact with each other,
84and different aspect of data management like storage, transportation, encryption, etc. Here
85are some entities for our Twitter-like service:
86User: UserID, Name, Email, DoB, CreationData, LastLogin, etc.
87Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc.
88UserFollowo: UserdID1, UserID2
89FavoriteTweets: UserID, TweetID, TimeStamp
90Which database system should we use? Will NoSQL like Cassandra best fit our needs, or
91should we use a MySQL-like solution? What kind of block storage should we use to store
92photos and videos?
93Step 5: High-level design
94Draw a block diagram with 5-6 boxes representing the core components of our system. We
95should identify enough components that are needed to solve the actual problem from end-toend.
96For Twitter, at a high-level, we will need multiple application servers to serve all the
97read/write requests with load balancers in front of them for traffic distributions. If we’re
98assuming that we will have a lot more read traffic (as compared to write), we can decide to
99have separate servers for handling these scenarios. On the backend, we need an efficient
100database that can store all the tweets and can support a huge number of reads. We will also
101need a distributed file storage system for storing photos and videos.
1024
103Step 6: Detailed design
104Dig deeper into two or three components; interviewer’s feedback should always guide us what
105parts of the system need further discussion. We should be able to present different
106approaches, their pros and cons, and explain why we will prefer one approach on the other.
107Remember there is no single answer, the only important thing is to consider tradeoffs
108between different options while keeping system constraints in mind.
109• Since we will be storing a massive amount of data, how should we partition our data to
110distribute it to multiple databases? Should we try to store all the data of a user on the
111same database? What issue could it cause?
112• How will we handle hot users who tweet a lot or follow lots of people?
113• Since users’ timeline will contain the most recent (and relevant) tweets, should we try to
114store our data in such a way that is optimized for scanning the latest tweets?
115• How much and at which layer should we introduce cache to speed things up?
116• What components need better load balancing?
117Step 7: Identifying and resolving bottlenecks
118Try to discuss as many bottlenecks as possible and different approaches to mitigate them.
119• Is there any single point of failure in our system? What are we doing to mitigate it?
120• Do we have enough replicas of the data so that if we lose a few servers we can still serve
121our users?
122• Similarly, do we have enough copies of different services running such that a few
123failures will not cause total system shutdown?
124• How are we monitoring the performance of our service? Do we get alerts whenever
125critical components fail or their performance degrades?
1265
127Summary
128In short, preparation and being organized during the interview are the keys to be successful in
129system design interviews. The above-mentioned steps should guide you to remain on track
130and cover all the different aspects while designing a system.
131Let’s apply the above guidelines to design a few systems that are asked in SDIs.
1326
133Designing a URL Shortening service like TinyURL
134Let's design a URL shortening service like TinyURL. This service will provide short aliases
135redirecting to long URLs. Similar services: bit.ly, goo.gl, qlink.me, etc. Difficulty Level: Easy
1361. Why do we need URL shortening?
137URL shortening is used to create shorter aliases for long URLs. We call these shortened
138aliases “short links.” Users are redirected to the original URL when they hit these short links.
139Short links save a lot of space when displayed, printed, messaged, or tweeted. Additionally,
140users are less likely to mistype shorter URLs.
141For example, if we shorten this page through TinyURL:
142https://www.educative.io/collection/page/5668639101419520/5649050225344512/566860
1430916475904/
144We would get:
145http://tinyurl.com/jlg8zpc
146The shortened URL is nearly one-third the size of the actual URL.
147URL shortening is used for optimizing links across devices, tracking individual links to
148analyze audience and campaign performance, and hiding affiliated original URLs.
149If you haven’t used tinyurl.com before, please try creating a new shortened URL and spend
150some time going through the various options their service offers. This will help you a lot in
151understanding this chapter.
1522. Requirements and Goals of the System
153? You should always clarify requirements at the beginning of the
154interview. Be sure to ask questions to find the exact scope of the system that the
155interviewer has in mind.
156Our URL shortening system should meet the following requirements:
157Functional Requirements:
1581. Given a URL, our service should generate a shorter and unique alias of it. This is called
159a short link. This link should be short enough to be easily copied and pasted into
160applications.
1612. When users access a short link, our service should redirect them to the original link.
1623. Users should optionally be able to pick a custom short link for their URL.
1637
1644. Links will expire after a standard default timespan. Users should be able to specify the
165expiration time.
166Non-Functional Requirements:
1671. The system should be highly available. This is required because, if our service is down,
168all the URL redirections will start failing.
1692. URL redirection should happen in real-time with minimal latency.
1703. Shortened links should not be guessable (not predictable).
171Extended Requirements:
1721. Analytics; e.g., how many times a redirection happened?
1732. Our service should also be accessible through REST APIs by other services.
1743. Capacity Estimation and Constraints
175Our system will be read-heavy. There will be lots of redirection requests compared to new
176URL shortenings. Let’s assume 100:1 ratio between read and write.
177Traffic estimates: Assuming, we will have 500M new URL shortenings per month, with 100:1
178read/write ratio, we can expect 50B redirections during the same period:
179100 * 500M => 50B
180What would be Queries Per Second (QPS) for our system? New URLs shortenings per second:
181500 million / (30 days * 24 hours * 3600 seconds) = ~200 URLs/s
182Considering 100:1 read/write ratio, URLs redirections per second will be:
183100 * 200 URLs/s = 20K/s
184Storage estimates: Let’s assume we store every URL shortening request (and associated
185shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total
186number of objects we expect to store will be 30 billion:
187500 million * 5 years * 12 months = 30 billion
188Let’s assume that each stored object will be approximately 500 bytes (just a ballpark
189estimate–we will dig into it later). We will need 15TB of total storage:
19030 billion * 500 bytes = 15 TB
191SAVERESET
192Bandwidth estimates: For write requests, since we expect 200 new URLs every second, total
193incoming data for our service will be 100KB per second:
1948
195200 * 500 bytes = 100 KB/s
196For read requests, since every second we expect ~20K URLs redirections, total outgoing data
197for our service would be 10MB per second:
19820K * 500 bytes = ~10 MB/s
199Memory estimates: If we want to cache some of the hot URLs that are frequently accessed,
200how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of
201URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
202Since we have 20K requests per second, we will be getting 1.7 billion requests per day:
20320K * 3600 seconds * 24 hours = ~1.7 billion
204To cache 20% of these requests, we will need 170GB of memory.
2050.2 * 1.7 billion * 500 bytes = ~170GB
206One thing to note here is that since there will be a lot of duplicate requests (of the same URL),
207therefore, our actual memory usage will be less than 170GB.
208High level estimates: Assuming 500 million new URLs per month and 100:1 read:write ratio,
209following is the summary of the high level estimates for our service:
210New URLs 200/s
211URL redirections 20K/s
212Incoming data 100KB/s
213Outgoing data 10MB/s
214Storage for 5 years 15TB
215Memory for cache 170GB
2164. System APIs
217? Once we've finalized the requirements, it's always a good idea to define
218the system APIs. This should explicitly state what is expected from the system.
219We can have SOAP or REST APIs to expose the functionality of our service. Following could
220be the definitions of the APIs for creating and deleting URLs:
221createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None)
2229
223Parameters:
224api_dev_key (string): The API developer key of a registered account. This will be used to,
225among other things, throttle users based on their allocated quota.
226original_url (string): Original URL to be shortened.
227custom_alias (string): Optional custom key for the URL.
228user_name (string): Optional user name to be used in encoding.
229expire_date (string): Optional expiration date for the shortened URL.
230Returns: (string)
231A successful insertion returns the shortened URL; otherwise, it returns an error code.
232deleteURL(api_dev_key, url_key)
233Where “url_key” is a string representing the shortened URL to be retrieved. A successful
234deletion returns ‘URL Removed’.
235How do we detect and prevent abuse? A malicious user can put us out of business by
236consuming all URL keys in the current design. To prevent abuse, we can limit users via their
237api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and
238redirections per some time period (which may be set to a different duration per developer
239key).
2405. Database Design
241? Defining the DB schema in the early stages of the interview would help
242to understand the data flow among various components and later would guide
243towards data partitioning.
244A few observations about the nature of the data we will store:
2451. We need to store billions of records.
2462. Each object we store is small (less than 1K).
2473. There are no relationships between records—other than storing which user created a
248URL.
2494. Our service is read-heavy.
250Database Schema:
251We would need two tables: one for storing information about the URL mappings, and one for
252the user’s data who created the short link.
253What kind of database should we use? Since we anticipate storing billions of rows, and we
254don’t need to use relationships between objects – a NoSQL key-value store 
25510
256like DynamoDB, Cassandra or Riak is a better choice. A NoSQL choice would also be easier to
257scale. Please see SQL vs NoSQL for more details.
2586. Basic System Design and Algorithm
259The problem we are solving here is, how to generate a short and unique key for a given URL.
260In the TinyURL example in Section 1, the shortened URL is “http://tinyurl.com/jlg8zpc”. The
261last six characters of this URL is the short key we want to generate. We’ll explore two
262solutions here:
263a. Encoding actual URL
264We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. The hash can
265then be encoded for displaying. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, az, 0-9]) and if we add ‘-’ and ‘.’ we can use base64 encoding. A reasonable question would be,
266what should be the length of the short key? 6, 8 or 10 characters.
267Using base64 encoding, a 6 letter long key would result in 64^6 = ~68.7 billion possible
268strings
269Using base64 encoding, an 8 letter long key would result in 64^8 = ~281 trillion possible
270strings
271With 68.7B unique strings, let’s assume six letter keys would suffice for our system.
272If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After
273base64 encoding, we’ll get a string having more than 21 characters (since each base64
274character encodes 6 bits of the hash value). Since we only have space for 8 characters per
275short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key.
276This could result in key duplication though, upon which we can choose some other characters
277out of the encoding string or swap some characters.
278What are different issues with our solution? We have the following couple of problems with
279our encoding scheme:
2801. If multiple users enter the same URL, they can get the same shortened URL, which is
281not acceptable.
2822. What if parts of the URL are URL-encoded?
283e.g., http://www.educative.io/distributed.php?id=design,
284and http://www.educative.io/distributed.php%3Fid%3Ddesign are identical except for
285the URL encoding.
286Workaround for the issues: We can append an increasing sequence number to each input
287URL to make it unique, and then generate a hash of it. We don’t need to store this sequence
288number in the databases, though. Possible problems with this approach could be an everincreasing sequence number. Can it overflow? Appending an increasing sequence number will
289also impact the performance of the service.
29011
291Another solution could be to append user id (which should be unique) to the input URL.
292However, if the user has not signed in, we would have to ask the user to choose a uniqueness
293key. Even after this, if we have a conflict, we have to keep generating a key until we get a
294unique one.
295Request flow for shortening of a URL
2961 of 9
297b. Generating keys offline
298We can have a standalone Key Generation Service (KGS) that generates random six letter
299strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to
300shorten a URL, we will just take one of the already-generated keys and use it. This approach
301will make things quite simple and fast. Not only are we not encoding the URL, but we won’t
302have to worry about duplications or collisions. KGS will make sure all the keys inserted into
303key-DB are unique
304Can concurrency cause problems? As soon as a key is used, it should be marked in the
305database to ensure it doesn’t get used again. If there are multiple servers reading keys
306concurrently, we might get a scenario where two or more servers try to read the same key
307from the database. How can we solve this concurrency problem?
308Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys:
309one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to
310one of the servers, it can move them to the used keys table. KGS can always keep some keys in
311memory so that it can quickly provide them whenever a server needs them.
312For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys
313table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded
314keys to some server, we will be wasting those keys–which is acceptable, given the huge
315number of keys we have.
316KGS also has to make sure not to give the same key to multiple servers. For that, it must
317synchronize (or get a lock on) the data structure holding the keys before removing keys from it
318and giving them to a server
319What would be the key-DB size? With base64 encoding, we can generate 68.7B unique six
320letters keys. If we need one byte to store one alpha-numeric character, we can store all these
321keys in:
3226 (characters per key) * 68.7B (unique keys) = 412 GB.
323Isn’t KGS a single point of failure? Yes, it is. To solve this, we can have a standby replica of
324KGS. Whenever the primary server dies, the standby server can take over to generate and
325provide keys.
32612
327Can each app server cache some keys from key-DB? Yes, this can surely speed things up.
328Although in this case, if the application server dies before consuming all the keys, we will end
329up losing those keys. This can be acceptable since we have 68B unique six letter keys.
330How would we perform a key lookup? We can look up the key in our database or key-value
331store to get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the
332browser, passing the stored URL in the “Location” field of the request. If that key is not
333present in our system, issue an “HTTP 404 Not Found” status or redirect the user back to the
334homepage.
335Should we impose size limits on custom aliases? Our service supports custom aliases. Users
336can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is
337reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a
338consistent URL database. Let’s assume users can specify a maximum of 16 characters per
339customer key (as reflected in the above database schema).
340High level system design for URL shortening
3417. Data Partitioning and Replication
342To scale out our DB, we need to partition it so that it can store information about billions of
343URLs. We need to come up with a partitioning scheme that would divide and store our data
344into different DB servers.
345a. Range Based Partitioning: We can store URLs in separate partitions based on the first
346letter of the hash key. Hence we save all the URLs starting with letter ‘A’ in one partition, save
347those that start with letter ‘B’ in another partition and so on. This approach is called rangebased partitioning. We can even combine certain less frequently occurring letters into one
348database partition. We should come up with a static partitioning scheme so that we can
349always store/find a URL in a predictable manner.
350The main problem with this approach is that it can lead to unbalanced DB servers. For
351example, we decide to put all URLs starting with letter ‘E’ into a DB partition, but later we
352realize that we have too many URLs that start with letter ‘E’.
35313
354b. Hash-Based Partitioning: In this scheme, we take a hash of the object we are storing. We
355then calculate which partition to use based upon the hash. In our case, we can take the hash of
356the ‘key’ or the short link to determine the partition in which we store the data object.
357Our hashing function will randomly distribute URLs into different partitions (e.g., our
358hashing function can always map any ‘key’ to a number between [1…256]), and this number
359would represent the partition in which we store our object.
360This approach can still lead to overloaded partitions, which can be solved by using Consistent
361Hashing.
3628. Cache
363We can cache URLs that are frequently accessed. We can use some off-the-shelf solution like
364Memcache, which can store full URLs with their respective hashes. The application servers,
365before hitting backend storage, can quickly check if the cache has the desired URL.
366How much cache should we have? We can start with 20% of daily traffic and, based on
367clients’ usage pattern, we can adjust how many cache servers we need. As estimated above, we
368need 170GB memory to cache 20% of daily traffic. Since a modern-day server can have 256GB
369memory, we can easily fit all the cache into one machine. Alternatively, we can use a couple of
370smaller servers to store all these hot URLs.
371Which cache eviction policy would best fit our needs? When the cache is full, and we want to
372replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU)
373can be a reasonable policy for our system. Under this policy, we discard the least recently used
374URL first. We can use a Linked Hash Map or a similar data structure to store our URLs and
375Hashes, which will also keep track of the URLs that have been accessed recently.
376To further increase the efficiency, we can replicate our caching servers to distribute load
377between them.
378How can each cache replica be updated? Whenever there is a cache miss, our servers would
379be hitting a backend database. Whenever this happens, we can update the cache and pass the
380new entry to all the cache replicas. Each replica can update their cache by adding the new
381entry. If a replica already has that entry, it can simply ignore it.
382Request flow for accessing a shortened URL
3831 of 11
3849. Load Balancer (LB)
385We can add a Load balancing layer at three places in our system:
3861. Between Clients and Application servers
3872. Between Application Servers and database servers
3883. Between Application Servers and Cache servers
38914
390Initially, we could use a simple Round Robin approach that distributes incoming requests
391equally among backend servers. This LB is simple to implement and does not introduce any
392overhead. Another benefit of this approach is that if a server is dead, LB will take it out of the
393rotation and will stop sending any traffic to it.
394A problem with Round Robin LB is that server load is not taken into consideration. If a server
395is overloaded or slow, the LB will not stop sending new requests to that server. To handle this,
396a more intelligent LB solution can be placed that periodically queries the backend server
397about its load and adjusts traffic based on that.
39810. Purging or DB cleanup
399Should entries stick around forever or should they be purged? If a user-specified expiration
400time is reached, what should happen to the link?
401If we chose to actively search for expired links to remove them, it would put a lot of pressure
402on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our
403service will make sure that only expired links will be deleted, although some expired links can
404live longer but will never be returned to users.
405• Whenever a user tries to access an expired link, we can delete the link and return an
406error to the user.
407• A separate Cleanup service can run periodically to remove expired links from our
408storage and cache. This service should be very lightweight and can be scheduled to run
409only when the user traffic is expected to be low.
410• We can have a default expiration time for each link (e.g., two years).
411• After removing an expired link, we can put the key back in the key-DB to be reused.
412• Should we remove links that haven’t been visited in some length of time, say six
413months? This could be tricky. Since storage is getting cheap, we can decide to keep links
414forever.
415Detailed component design for URL shortening
41615
41711. Telemetry
418How many times a short URL has been used, what were user locations, etc.? How would we
419store these statistics? If it is part of a DB row that gets updated on each view, what will happen
420when a popular URL is slammed with a large number of concurrent requests?
421Some statistics worth tracking: country of the visitor, date and time of access, web page that
422refers the click, browser, or platform from where the page was accessed.
42312. Security and Permissions
424Can users create private URLs or allow a particular set of users to access a URL?
425We can store permission level (public/private) with each URL in the database. We can also
426create a separate table to store UserIDs that have permission to see a specific URL. If a user
427does not have permission and tries to access a URL, we can send an error (HTTP 401) back.
428Given that we are storing our data in a NoSQL wide-column database like Cassandra, the key
429for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’). The
430columns will store the UserIDs of those users that have permissions to see the URL.
43116
432Designing Pastebin
433Let's design a Pastebin like web service, where users can store plain text. Users of the service
434will enter a piece of text and get a randomly generated URL to access it. Similar Services:
435pastebin.com, pasted.co, chopapp.com Difficulty Level: Easy
4361. What is Pastebin?
437Pastebin like services enable users to store plain text or images over the network (typically the
438Internet) and generate unique URLs to access the uploaded data. Such services are also used
439to share data over the network quickly, as users would just need to pass the URL to let other
440users see it.
441If you haven’t used pastebin.com before, please try creating a new ‘Paste’ there and spend
442some time going through the different options their service offers. This will help you a lot in
443understanding this chapter.
4442. Requirements and Goals of the System
445Our Pastebin service should meet the following requirements:
446Functional Requirements:
4471. Users should be able to upload or “paste” their data and get a unique URL to access it.
4482. Users will only be able to upload text.
4493. Data and links will expire after a specific timespan automatically; users should also be
450able to specify expiration time.
4514. Users should optionally be able to pick a custom alias for their paste.
452Non-Functional Requirements:
4531. The system should be highly reliable, any data uploaded should not be lost.
4542. The system should be highly available. This is required because if our service is down,
455users will not be able to access their Pastes.
4563. Users should be able to access their Pastes in real-time with minimum latency.
4574. Paste links should not be guessable (not predictable).
458Extended Requirements:
4591. Analytics, e.g., how many times a paste was accessed?
4602. Our service should also be accessible through REST APIs by other services.
4613. Some Design Considerations
462Pastebin shares some requirements with URL Shortening service, but there are some
463additional design considerations we should keep in mind.
46417
465What should be the limit on the amount of text user can paste at a time? We can limit users
466not to have Pastes bigger than 10MB to stop the abuse of the service.
467Should we impose size limits on custom URLs? Since our service supports custom URLs,
468users can pick any URL that they like, but providing a custom URL is not mandatory.
469However, it is reasonable (and often desirable) to impose a size limit on custom URLs, so that
470we have a consistent URL database.
4714. Capacity Estimation and Constraints
472Our services will be read-heavy; there will be more read requests compared to new Pastes
473creation. We can assume a 5:1 ratio between read and write.
474Traffic estimates: Pastebin services are not expected to have traffic similar to Twitter or
475Facebook, let’s assume here that we get one million new pastes added to our system every day.
476This leaves us with five million reads per day.
477New Pastes per second:
4781M / (24 hours * 3600 seconds) ~= 12 pastes/sec
479Paste reads per second:
4805M / (24 hours * 3600 seconds) ~= 58 reads/sec
481Storage estimates: Users can upload maximum 10MB of data; commonly Pastebin like
482services are used to share source code, configs or logs. Such texts are not huge, so let’s assume
483that each paste on average contains 10KB.
484At this rate, we will be storing 10GB of data per day.
4851M * 10KB => 10 GB/day
486If we want to store this data for ten years we would need the total storage capacity of 36TB.
487With 1M pastes every day we will have 3.6 billion Pastes in 10 years. We need to generate and
488store keys to uniquely identify these pastes. If we use base64 encoding ([A-Z, a-z, 0-9, ., -]) we
489would need six letters strings:
49064^6 ~= 68.7 billion unique strings
491If it takes one byte to store one character, total size required to store 3.6B keys would be:
4923.6B * 6 => 22 GB
49322GB is negligible compared to 36TB. To keep some margin, we will assume a 70% capacity
494model (meaning we don’t want to use more than 70% of our total storage capacity at any
495point), which raises our storage needs to 51.4TB.
49618
497Bandwidth estimates: For write requests, we expect 12 new pastes per second, resulting in
498120KB of ingress per second.
49912 * 10KB => 120 KB/s
500As for the read request, we expect 58 requests per second. Therefore, total data egress (sent to
501users) will be 0.6 MB/s.
50258 * 10KB => 0.6 MB/s
503Although total ingress and egress are not big, we should keep these numbers in mind while
504designing our service.
505Memory estimates: We can cache some of the hot pastes that are frequently accessed.
506Following the 80-20 rule, meaning 20% of hot pastes generate 80% of traffic, we would like to
507cache these 20% pastes
508Since we have 5M read requests per day, to cache 20% of these requests, we would need:
5090.2 * 5M * 10KB ~= 10 GB
5105. System APIs
511We can have SOAP or REST APIs to expose the functionality of our service. Following could
512be the definitions of the APIs to create/retrieve/delete Pastes:
513addPaste(api_dev_key, paste_data, custom_url=None user_name=None, paste_name=None, expire_dat
514e=None)
515Parameters:
516api_dev_key (string): The API developer key of a registered account. This will be used to,
517among other things, throttle users based on their allocated quota.
518paste_data (string): Textual data of the paste.
519custom_url (string): Optional custom URL.
520user_name (string): Optional user name to be used to generate URL.
521paste_name (string): Optional name of the paste
522expire_date (string): Optional expiration date for the paste.
523Returns: (string)
524A successful insertion returns the URL through which the paste can be accessed, otherwise, it
525will return an error code.
526Similarly, we can have retrieve and delete Paste APIs:
527getPaste(api_dev_key, api_paste_key)
528Where “api_paste_key” is a string representing the Paste Key of the paste to be retrieved. This
529API will return the textual data of the paste.
53019
531deletePaste(api_dev_key, api_paste_key)
532A successful deletion returns ‘true’, otherwise returns ‘false’.
5336. Database Design
534A few observations about the nature of the data we are storing:
5351. We need to store billions of records.
5362. Each metadata object we are storing would be small (less than 1KB).
5373. Each paste object we are storing can be of medium size (it can be a few MB).
5384. There are no relationships between records, except if we want to store which user
539created what Paste.
5405. Our service is read-heavy.
541Database Schema:
542We would need two tables, one for storing information about the Pastes and the other for
543users’ data.
544Here, ‘URlHash’ is the URL equivalent of the TinyURL and ‘ContentKey’ is a reference to an
545external object storing the contents of the paste; we’ll discuss the external storage of the paste
546contents later in the chapter.
5477. High Level Design
548At a high level, we need an application layer that will serve all the read and write requests.
549Application layer will talk to a storage layer to store and retrieve data. We can segregate our
550storage layer with one database storing metadata related to each paste, users, etc., while the
551other storing the paste contents in some object storage (like Amazon S3). This division of data
552will also allow us to scale them individually.
5538. Component Design
554a. Application layer
555Our application layer will process all incoming and outgoing requests. The application servers
556will be talking to the backend data store components to serve the requests.
557How to handle a write request? Upon receiving a write request, our application server will
558generate a six-letter random string, which would serve as the key of the paste (if the user has 
55920
560not provided a custom key). The application server will then store the contents of the paste
561and the generated key in the database. After the successful insertion, the server can return the
562key to the user. One possible problem here could be that the insertion fails because of a
563duplicate key. Since we are generating a random key, there is a possibility that the newly
564generated key could match an existing one. In that case, we should regenerate a new key and
565try again. We should keep retrying until we don’t see failure due to the duplicate key. We
566should return an error to the user if the custom key they have provided is already present in
567our database.
568Another solution of the above problem could be to run a standalone Key Generation
569Service (KGS) that generates random six letters strings beforehand and stores them in a
570database (let’s call it key-DB). Whenever we want to store a new paste, we will just take one of
571the already generated keys and use it. This approach will make things quite simple and fast
572since we will not be worrying about duplications or collisions. KGS will make sure all the keys
573inserted in key-DB are unique. KGS can use two tables to store keys, one for keys that are not
574used yet and one for all the used keys. As soon as KGS gives some keys to an application
575server, it can move these to the used keys table. KGS can always keep some keys in memory so
576that whenever a server needs them, it can quickly provide them. As soon as KGS loads some
577keys in memory, it can move them to the used keys table, this way we can make sure each
578server gets unique keys. If KGS dies before using all the keys loaded in memory, we will be
579wasting those keys. We can ignore these keys given that we have a huge number of them.
580Isn’t KGS a single point of failure? Yes, it is. To solve this, we can have a standby replica of
581KGS and whenever the primary server dies it can take over to generate and provide keys.
582Can each app server cache some keys from key-DB? Yes, this can surely speed things up.
583Although in this case, if the application server dies before consuming all the keys, we will end
584up losing those keys. This could be acceptable since we have 68B unique six letters keys,
585which are a lot more than we require.
586How does it handle a paste read request? Upon receiving a read paste request, the
587application service layer contacts the datastore. The datastore searches for the key, and if it is
588found, returns the paste’s contents. Otherwise, an error code is returned.
589b. Datastore layer
590We can divide our datastore layer into two:
5911. Metadata database: We can use a relational database like MySQL or a Distributed KeyValue store like Dynamo or Cassandra.
5922. Object storage: We can store our contents in an Object Storage like Amazon’s S3.
593Whenever we feel like hitting our full capacity on content storage, we can easily increase
594it by adding more servers.
59521
596Detailed component design for Pastebin
5979. Purging or DB Cleanup
598Please see Designing a URL Shortening service.
59910. Data Partitioning and Replication
600Please see Designing a URL Shortening service.
60111. Cache and Load Balancer
602Please see Designing a URL Shortening service.
60312. Security and Permissions
604Please see Designing a URL Shortening service.
60522
606Designing Instagram
607Let's design a photo-sharing service like Instagram, where users can upload photos to share
608them with other users. Similar Services: Flickr, Picasa Difficulty Level: Medium
6091. What is Instagram?
610Instagram is a social networking service which enables its users to upload and share their
611photos and videos with other users. Instagram users can choose to share information either
612publicly or privately. Anything shared publicly can be seen by any other user, whereas
613privately shared content can only be accessed by a specified set of people. Instagram also
614enables its users to share through many other social networking platforms, such as Facebook,
615Twitter, Flickr, and Tumblr.
616For the sake of this exercise, we plan to design a simpler version of Instagram, where a user
617can share photos and can also follow other users. The ‘News Feed’ for each user will consist of
618top photos of all the people the user follows.
6192. Requirements and Goals of the System
620We’ll focus on the following set of requirements while designing the Instagram:
621Functional Requirements
6221. Users should be able to upload/download/view photos.
6232. Users can perform searches based on photo/video titles.
6243. Users can follow other users.
6254. The system should be able to generate and display a user’s News Feed consisting of top
626photos from all the people the user follows.
627Non-functional Requirements
6281. Our service needs to be highly available.
6292. The acceptable latency of the system is 200ms for News Feed generation.
6303. Consistency can take a hit (in the interest of availability), if a user doesn’t see a photo
631for a while; it should be fine.
6324. The system should be highly reliable; any uploaded photo or video should never be lost.
633Not in scope: Adding tags to photos, searching photos on tags, commenting on photos,
634tagging users to photos, who to follow, etc.
6353. Some Design Considerations
636The system would be read-heavy, so we will focus on building a system that can retrieve
637photos quickly.
63823
6391. Practically, users can upload as many photos as they like. Efficient management of
640storage should be a crucial factor while designing this system.
6412. Low latency is expected while viewing photos.
6423. Data should be 100% reliable. If a user uploads a photo, the system will guarantee that
643it will never be lost.
6444. Capacity Estimation and Constraints
645• Let’s assume we have 500M total users, with 1M daily active users.
646• 2M new photos every day, 23 new photos every second.
647• Average photo file size => 200KB
648• Total space required for 1 day of photos
6492M * 200KB => 400 GB
650• Total space required for 10 years:
651400GB * 365 (days a year) * 10 (years) ~= 1425TB
6525. High Level System Design
653At a high-level, we need to support two scenarios, one to upload photos and the other to
654view/search photos. Our service would need some object storage servers to store photos and
655also some database servers to store metadata information about the photos.
6566. Database Schema
657? Defining the DB schema in the early stages of the interview would help
658to understand the data flow among various components and later would guide
659towards data partitioning.
66024
661We need to store data about users, their uploaded photos, and people they follow. Photo table
662will store all data related to a photo; we need to have an index on (PhotoID, CreationDate)
663since we need to fetch recent photos first.
664A straightforward approach for storing the above schema would be to use an RDBMS like
665MySQL since we require joins. But relational databases come with their challenges, especially
666when we need to scale them. For details, please take a look at SQL vs. NoSQL.
667We can store photos in a distributed file storage like HDFS or S3.
668We can store the above schema in a distributed key-value store to enjoy the benefits offered by
669NoSQL. All the metadata related to photos can go to a table where the ‘key’ would be the
670‘PhotoID’ and the ‘value’ would be an object containing PhotoLocation, UserLocation,
671CreationTimestamp, etc.
672We need to store relationships between users and photos, to know who owns which photo. We
673also need to store the list of people a user follows. For both of these tables, we can use a widecolumn datastore like Cassandra. For the ‘UserPhoto’ table, the ‘key’ would be ‘UserID’ and
674the ‘value’ would be the list of ‘PhotoIDs’ the user owns, stored in different columns. We will
675have a similar scheme for the ‘UserFollow’ table.
676Cassandra or key-value stores in general, always maintain a certain number of replicas to offer
677reliability. Also, in such data stores, deletes don’t get applied instantly, data is retained for
678certain days (to support undeleting) before getting removed from the system permanently.
6797. Data Size Estimation
680Let’s estimate how much data will be going into each table and how much total storage we will
681need for 10 years.
682User: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be of
68368 bytes:
684UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) +
685CreationDate (4 bytes) + LastLogin (4 bytes) = 68 bytes
68625
687If we have 500 million users, we will need 32GB of total storage.
688500 million * 68 ~= 32GB
689Photo: Each row in Photo’s table will be of 284 bytes:
690PhotoID (4 bytes) + UserID (4 bytes) + PhotoPath (256 bytes) + PhotoLatitude (4 bytes) +
691PhotLongitude(4 bytes) + UserLatitude (4 bytes) + UserLongitude (4 bytes) + CreationDate (4
692bytes) = 284 bytes
693If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day:
6942M * 284 bytes ~= 0.5GB per day
695For 10 years we will need 1.88TB of storage.
696UserFollow: Each row in the UserFollow table will consist of 8 bytes. If we have 500 million
697users and on average each user follows 500 users. We would need 1.82TB of storage for the
698UserFollow table:
699500 million users * 500 followers * 8 bytes ~= 1.82TB
700Total space required for all tables for 10 years will be 3.7TB:
70132GB + 1.88TB + 1.82TB ~= 3.7TB
7028. Component Design
703Photo uploads (or writes) can be slow as they have to go to the disk, whereas reads will be
704faster, especially if they are being served from cache.
705Uploading users can consume all the available connections, as uploading is a slow process.
706This means that ‘reads’ cannot be served if the system gets busy with all the write requests.
707We should keep in mind that web servers have a connection limit before designing our
708system. If we assume that a web server can have a maximum of 500 connections at any time,
709then it can’t have more than 500 concurrent uploads or reads. To handle this bottleneck we
710can split reads and writes into separate services. We will have dedicated servers for reads and
711different servers for writes to ensure that uploads don’t hog the system.
712Separating photos’ read and write requests will also allow us to scale and optimize each of
713these operations independently.
71426
7159. Reliability and Redundancy
716Losing files is not an option for our service. Therefore, we will store multiple copies of each
717file so that if one storage server dies we can retrieve the photo from the other copy present on
718a different storage server.
719This same principle also applies to other components of the system. If we want to have high
720availability of the system, we need to have multiple replicas of services running in the system,
721so that if a few services die down the system still remains available and running. Redundancy
722removes the single point of failure in the system.
723If only one instance of a service is required to run at any point, we can run a redundant
724secondary copy of the service that is not serving any traffic, but it can take control after the
725failover when primary has a problem.
726Creating redundancy in a system can remove single points of failure and provide a backup or
727spare functionality if needed in a crisis. For example, if there are two instances of the same
728service running in production and one fails or degrades, the system can failover to the healthy
729copy. Failover can happen automatically or require manual intervention.
73027
73110. Data Sharding
732Let’s discuss different schemes for metadata sharding:
733a. Partitioning based on UserID Let’s assume we shard based on the ‘UserID’ so that we can
734keep all photos of a user on the same shard. If one DB shard is 1TB, we will need four shards
735to store 3.7TB of data. Let’s assume for better performance and scalability we keep 10 shards.
736So we’ll find the shard number by UserID % 10 and then store the data there. To uniquely
737identify any photo in our system, we can append shard number with each PhotoID.
738How can we generate PhotoIDs? Each DB shard can have its own auto-increment sequence
739for PhotoIDs and since we will append ShardID with each PhotoID, it will make it unique
740throughout our system.
741What are the different issues with this partitioning scheme?
7421. How would we handle hot users? Several people follow such hot users and a lot of other
743people see any photo they upload.
7442. Some users will have a lot of photos compared to others, thus making a non-uniform
745distribution of storage.
7463. What if we cannot store all pictures of a user on one shard? If we distribute photos of a
747user onto multiple shards will it cause higher latencies?
7484. Storing all photos of a user on one shard can cause issues like unavailability of all of the
749user’s data if that shard is down or higher latency if it is serving high load etc.
750b. Partitioning based on PhotoID If we can generate unique PhotoIDs first and then find a
751shard number through “PhotoID % 10”, the above problems will have been solved. We would
752not need to append ShardID with PhotoID in this case as PhotoID will itself be unique
753throughout the system.
754How can we generate PhotoIDs? Here we cannot have an auto-incrementing sequence in
755each shard to define PhotoID because we need to know PhotoID first to find the shard where
756it will be stored. One solution could be that we dedicate a separate database instance to
757generate auto-incrementing IDs. If our PhotoID can fit into 64 bits, we can define a table
758containing only a 64 bit ID field. So whenever we would like to add a photo in our system, we
759can insert a new row in this table and take that ID to be our PhotoID of the new photo.
760Wouldn’t this key generating DB be a single point of failure? Yes, it would be. A workaround
761for that could be defining two such databases with one generating even numbered IDs and the
762other odd numbered. For the MySQL, the following script can define such sequences:
763KeyGeneratingServer1:
764auto-increment-increment = 2
765auto-increment-offset = 1
766KeyGeneratingServer2:
767auto-increment-increment = 2
76828
769auto-increment-offset = 2
770We can put a load balancer in front of both of these databases to round robin between them
771and to deal with downtime. Both these servers could be out of sync with one generating more
772keys than the other, but this will not cause any issue in our system. We can extend this design
773by defining separate ID tables for Users, Photo-Comments, or other objects present in our
774system.
775Alternately, we can implement a ‘key’ generation scheme similar to what we have discussed
776in Designing a URL Shortening service like TinyURL.
777How can we plan for the future growth of our system? We can have a large number of logical
778partitions to accommodate future data growth, such that in the beginning, multiple logical
779partitions reside on a single physical database server. Since each database server can have
780multiple database instances on it, we can have separate databases for each logical partition on
781any server. So whenever we feel that a particular database server has a lot of data, we can
782migrate some logical partitions from it to another server. We can maintain a config file (or a
783separate database) that can map our logical partitions to database servers; this will enable us
784to move partitions around easily. Whenever we want to move a partition, we only have to
785update the config file to announce the change.
78611. Ranking and News Feed Generation
787To create the News Feed for any given user, we need to fetch the latest, most popular and
788relevant photos of the people the user follows.
789For simplicity, let’s assume we need to fetch top 100 photos for a user’s News Feed. Our
790application server will first get a list of people the user follows and then fetch metadata info of
791latest 100 photos from each user. In the final step, the server will submit all these photos to
792our ranking algorithm which will determine the top 100 photos (based on recency, likeness,
793etc.) and return them to the user. A possible problem with this approach would be higher
794latency as we have to query multiple tables and perform sorting/merging/ranking on the
795results. To improve the efficiency, we can pre-generate the News Feed and store it in a
796separate table.
797Pre-generating the News Feed: We can have dedicated servers that are continuously
798generating users’ News Feeds and storing them in a ‘UserNewsFeed’ table. So whenever any
799user needs the latest photos for their News Feed, we will simply query this table and return
800the results to the user.
801Whenever these servers need to generate the News Feed of a user, they will first query the
802UserNewsFeed table to find the last time the News Feed was generated for that user. Then,
803new News Feed data will be generated from that time onwards (following the steps mentioned
804above).
805What are the different approaches for sending News Feed contents to the users?
80629
8071. Pull: Clients can pull the News Feed contents from the server on a regular basis or manually
808whenever they need it. Possible problems with this approach are a) New data might not be
809shown to the users until clients issue a pull request b) Most of the time pull requests will
810result in an empty response if there is no new data.
8112. Push: Servers can push new data to the users as soon as it is available. To efficiently
812manage this, users have to maintain a Long Poll request with the server for receiving the
813updates. A possible problem with this approach is, a user who follows a lot of people or a
814celebrity user who has millions of followers; in this case, the server has to push updates quite
815frequently.
8163. Hybrid: We can adopt a hybrid approach. We can move all the users who have a high
817number of follows to a pull-based model and only push data to those users who have a few
818hundred (or thousand) follows. Another approach could be that the server pushes updates to
819all the users not more than a certain frequency, letting users with a lot of follows/updates to
820regularly pull data.
821For a detailed discussion about News Feed generation, take a look at Designing Facebook’s
822Newsfeed.
82312. News Feed Creation with Sharded Data
824One of the most important requirement to create the News Feed for any given user is to fetch
825the latest photos from all people the user follows. For this, we need to have a mechanism to
826sort photos on their time of creation. To efficiently do this, we can make photo creation time
827part of the PhotoID. As we will have a primary index on PhotoID, it will be quite quick to find
828the latest PhotoIDs.
829We can use epoch time for this. Let’s say our PhotoID will have two parts; the first part will be
830representing epoch time and the second part will be an auto-incrementing sequence. So to
831make a new PhotoID, we can take the current epoch time and append an auto-incrementing
832ID from our key-generating DB. We can figure out shard number from this PhotoID ( PhotoID
833% 10) and store the photo there.
834What could be the size of our PhotoID? Let’s say our epoch time starts today, how many bits
835we would need to store the number of seconds for next 50 years?
83686400 sec/day * 365 (days a year) * 50 (years) => 1.6 billion seconds
837We would need 31 bits to store this number. Since on the average, we are expecting 23 new
838photos per second; we can allocate 9 bits to store auto incremented sequence. So every second
839we can store (2^9 => 512) new photos. We can reset our auto incrementing sequence every
840second.
841We will discuss more details about this technique under ‘Data Sharding’ in Designing Twitter.
84230
84313. Cache and Load balancing
844Our service would need a massive-scale photo delivery system to serve the globally distributed
845users. Our service should push its content closer to the user using a large number of
846geographically distributed photo cache servers and use CDNs (for details see Caching).
847We can introduce a cache for metadata servers to cache hot database rows. We can use
848Memcache to cache the data and Application servers before hitting database can quickly check
849if the cache has desired rows. Least Recently Used (LRU) can be a reasonable cache eviction
850policy for our system. Under this policy, we discard the least recently viewed row first.
851How can we build more intelligent cache? If we go with 80-20 rule, i.e., 20% of daily read
852volume for photos is generating 80% of traffic which means that certain photos are so popular
853that the majority of people read them. This dictates that we can try caching 20% of daily read
854volume of photos and metadata.
85531
856Designing Dropbox
857Let's design a file hosting service like Dropbox or Google Drive. Cloud file storage enables
858users to store their data on remote servers. Usually, these servers are maintained by cloud
859storage providers and made available to users over a network (typically through the
860Internet). Users pay for their cloud data storage on a monthly basis. Similar Services:
861OneDrive, Google Drive Difficulty Level: Medium
8621. Why Cloud Storage?
863Cloud file storage services have become very popular recently as they simplify the storage and
864exchange of digital resources among multiple devices. The shift from using single personal
865computers to using multiple devices with different platforms and operating systems such as
866smartphones and tablets each with portable access from various geographical locations at any
867time, is believed to be accountable for the huge popularity of cloud storage services. Following
868are some of the top benefits of such services:
869Availability: The motto of cloud storage services is to have data availability anywhere,
870anytime. Users can access their files/photos from any device whenever and wherever they
871like.
872Reliability and Durability: Another benefit of cloud storage is that it offers 100% reliability
873and durability of data. Cloud storage ensures that users will never lose their data by keeping
874multiple copies of the data stored on different geographically located servers.
875Scalability: Users will never have to worry about getting out of storage space. With cloud
876storage you have unlimited storage as long as you are ready to pay for it.
877If you haven’t used dropbox.com before, we would highly recommend creating an account
878there and uploading/editing a file and also going through the different options their service
879offers. This will help you a lot in understanding this chapter.
8802. Requirements and Goals of the System
881?You should always clarify requirements at the beginning of the interview. Be
882sure to ask questions to find the exact scope of the system that the interviewer
883has in mind.
884What do we wish to achieve from a Cloud Storage system? Here are the top-level
885requirements for our system:
8861. Users should be able to upload and download their files/photos from any device.
8872. Users should be able to share files or folders with other users.
8883. Our service should support automatic synchronization between devices, i.e., after
889updating a file on one device, it should get synchronized on all devices.
89032
8914. The system should support storing large files up to a GB.
8925. ACID-ity is required. Atomicity, Consistency, Isolation and Durability of all file
893operations should be guaranteed.
8946. Our system should support offline editing. Users should be able to add/delete/modify
895files while offline, and as soon as they come online, all their changes should be synced
896to the remote servers and other online devices.
897Extended Requirements
898• The system should support snapshotting of the data, so that users can go back to any
899version of the files.
9003. Some Design Considerations
901• We should expect huge read and write volumes.
902• Read to write ratio is expected to be nearly the same.
903• Internally, files can be stored in small parts or chunks (say 4MB); this can provide a lot
904of benefits i.e. all failed operations shall only be retried for smaller parts of a file. If a
905user fails to upload a file, then only the failing chunk will be retried.
906• We can reduce the amount of data exchange by transferring updated chunks only.
907• By removing duplicate chunks, we can save storage space and bandwidth usage.
908• Keeping a local copy of the metadata (file name, size, etc.) with the client can save us a
909lot of round trips to the server.
910• For small changes, clients can intelligently upload the diffs instead of the whole chunk.
9114. Capacity Estimation and Constraints
912• Let’s assume that we have 500M total users, and 100M daily active users (DAU).
913• Let’s assume that on average each user connects from three different devices.
914• On average if a user has 200 files/photos, we will have 100 billion total files.
915• Let’s assume that average file size is 100KB, this would give us ten petabytes of total
916storage.
917100B * 100KB => 10PB
918• Let’s also assume that we will have one million active connections per minute.
9195. High Level Design
920The user will specify a folder as the workspace on their device. Any file/photo/folder placed in
921this folder will be uploaded to the cloud, and whenever a file is modified or deleted, it will be
922reflected in the same way in the cloud storage. The user can specify similar workspaces on all
923their devices and any modification done on one device will be propagated to all other devices
924to have the same view of the workspace everywhere.
92533
926At a high level, we need to store files and their metadata information like File Name, File Size,
927Directory, etc., and who this file is shared with. So, we need some servers that can help the
928clients to upload/download files to Cloud Storage and some servers that can facilitate
929updating metadata about files and users. We also need some mechanism to notify all clients
930whenever an update happens so they can synchronize their files.
931As shown in the diagram below, Block servers will work with the clients to upload/download
932files from cloud storage and Metadata servers will keep metadata of files updated in a SQL or
933NoSQL database. Synchronization servers will handle the workflow of notifying all clients
934about different changes for synchronization.
935High level design for Dropbox
9366. Component Design
937Let’s go through the major components of our system one by one:
938a. Client
939The Client Application monitors the workspace folder on the user’s machine and syncs all
940files/folders in it with the remote Cloud Storage. The client application will work with the
941storage servers to upload, download, and modify actual files to backend Cloud Storage. The
942client also interacts with the remote Synchronization Service to handle any file metadata
943updates, e.g., change in the file name, size, modification date, etc.
944Here are some of the essential operations for the client:
9451. Upload and download files.
9462. Detect file changes in the workspace folder.
9473. Handle conflict due to offline or concurrent updates.
94834
949How do we handle file transfer efficiently? As mentioned above, we can break each file into
950smaller chunks so that we transfer only those chunks that are modified and not the whole file.
951Let’s say we divide each file into fixed sizes of 4MB chunks. We can statically calculate what
952could be an optimal chunk size based on 1) Storage devices we use in the cloud to optimize
953space utilization and input/output operations per second (IOPS) 2) Network bandwidth 3)
954Average file size in the storage etc. In our metadata, we should also keep a record of each file
955and the chunks that constitute it.
956Should we keep a copy of metadata with Client? Keeping a local copy of metadata not only
957enable us to do offline updates but also saves a lot of round trips to update remote metadata.
958How can clients efficiently listen to changes happening with other clients? One solution
959could be that the clients periodically check with the server if there are any changes. The
960problem with this approach is that we will have a delay in reflecting changes locally as clients
961will be checking for changes periodically compared to a server notifying whenever there is
962some change. If the client frequently checks the server for changes, it will not only be wasting
963bandwidth, as the server has to return an empty response most of the time, but will also be
964keeping the server busy. Pulling information in this manner is not scalable.
965A solution to the above problem could be to use HTTP long polling. With long polling the
966client requests information from the server with the expectation that the server may not
967respond immediately. If the server has no new data for the client when the poll is received,
968instead of sending an empty response, the server holds the request open and waits for
969response information to become available. Once it does have new information, the server
970immediately sends an HTTP/S response to the client, completing the open HTTP/S Request.
971Upon receipt of the server response, the client can immediately issue another server request
972for future updates.
973Based on the above considerations, we can divide our client into following four parts:
974I. Internal Metadata Database will keep track of all the files, chunks, their versions, and their
975location in the file system.
976II. Chunker will split the files into smaller pieces called chunks. It will also be responsible for
977reconstructing a file from its chunks. Our chunking algorithm will detect the parts of the files
978that have been modified by the user and only transfer those parts to the Cloud Storage; this
979will save us bandwidth and synchronization time.
980III. Watcher will monitor the local workspace folders and notify the Indexer (discussed
981below) of any action performed by the users, e.g. when users create, delete, or update files or
982folders. Watcher also listens to any changes happening on other clients that are broadcasted
983by Synchronization service.
984IV. Indexer will process the events received from the Watcher and update the internal
985metadata database with information about the chunks of the modified files. Once the chunks
986are successfully submitted/downloaded to the Cloud Storage, the Indexer will communicate
987with the remote Synchronization Service to broadcast changes to other clients and update
988remote metadata database.
98935
990How should clients handle slow servers? Clients should exponentially back-off if the server
991is busy/not-responding. Meaning, if a server is too slow to respond, clients should delay their
992retries and this delay should increase exponentially.
993Should mobile clients sync remote changes immediately? Unlike desktop or web clients,
994mobile clients usually sync on demand to save user’s bandwidth and space.
995b. Metadata Database
996The Metadata Database is responsible for maintaining the versioning and metadata
997information about files/chunks, users, and workspaces. The Metadata Database can be a
998relational database such as MySQL, or a NoSQL database service such as DynamoDB.
999Regardless of the type of the database, the Synchronization Service should be able to provide a
1000consistent view of the files using a database, especially if more than one user is working with
1001the same file simultaneously. Since NoSQL data stores do not support ACID properties in
1002favor of scalability and performance, we need to incorporate the support for ACID properties
1003programmatically in the logic of our Synchronization Service in case we opt for this kind of
1004database. However, using a relational database can simplify the implementation of the
1005Synchronization Service as they natively support ACID properties.
1006The metadata Database should be storing information about following objects:
10071. Chunks
10082. Files
10093. User
10104. Devices
10115. Workspace (sync folders)
101236
1013c. Synchronization Service
1014The Synchronization Service is the component that processes file updates made by a client
1015and applies these changes to other subscribed clients. It also synchronizes clients’ local
1016databases with the information stored in the remote Metadata DB. The Synchronization
1017Service is the most important part of the system architecture due to its critical role in
1018managing the metadata and synchronizing users’ files. Desktop clients communicate with the
1019Synchronization Service to either obtain updates from the Cloud Storage or send files and
1020updates to the Cloud Storage and, potentially, other users. If a client was offline for a period, it
1021polls the system for new updates as soon as they come online. When the Synchronization
1022Service receives an update request, it checks with the Metadata Database for consistency and
1023then proceeds with the update. Subsequently, a notification is sent to all subscribed users or
1024devices to report the file update.
1025The Synchronization Service should be designed in such a way that it transmits less data
1026between clients and the Cloud Storage to achieve a better response time. To meet this design
1027goal, the Synchronization Service can employ a differencing algorithm to reduce the amount
1028of the data that needs to be synchronized. Instead of transmitting entire files from clients to
1029the server or vice versa, we can just transmit the difference between two versions of a file.
1030Therefore, only the part of the file that has been changed is transmitted. This also decreases
1031bandwidth consumption and cloud data storage for the end user. As described above, we will
1032be dividing our files into 4MB chunks and will be transferring modified chunks only. Server
1033and clients can calculate a hash (e.g., SHA-256) to see whether to update the local copy of a
1034chunk or not. On the server, if we already have a chunk with a similar hash (even from
1035another user), we don’t need to create another copy, we can use the same chunk. This is
1036discussed in detail later under Data Deduplication.
1037To be able to provide an efficient and scalable synchronization protocol we can consider using
1038a communication middleware between clients and the Synchronization Service. The
1039messaging middleware should provide scalable message queuing and change notifications to
1040support a high number of clients using pull or push strategies. This way, multiple
1041Synchronization Service instances can receive requests from a global request Queue, and the
1042communication middleware will be able to balance its load.
1043d. Message Queuing Service
1044An important part of our architecture is a messaging middleware that should be able to
1045handle a substantial number of requests. A scalable Message Queuing Service that supports
1046asynchronous message-based communication between clients and the Synchronization
1047Service best fits the requirements of our application. The Message Queuing Service supports
1048asynchronous and loosely coupled message-based communication between distributed
1049components of the system. The Message Queuing Service should be able to efficiently store
1050any number of messages in a highly available, reliable and scalable queue.
105137
1052The Message Queuing Service will implement two types of queues in our system. The Request
1053Queue is a global queue and all clients will share it. Clients’ requests to update the Metadata
1054Database will be sent to the Request Queue first, from there the Synchronization Service will
1055take it to update metadata. The Response Queues that correspond to individual subscribed
1056clients are responsible for delivering the update messages to each client. Since a message will
1057be deleted from the queue once received by a client, we need to create separate Response
1058Queues for each subscribed client to share update messages.
1059e. Cloud/Block Storage
1060Cloud/Block Storage stores chunks of files uploaded by the users. Clients directly interact with
1061the storage to send and receive objects from it. Separation of the metadata from storage
1062enables us to use any storage either in the cloud or in-house.
106338
1064Detailed component design for Dropbox
10657. File Processing Workflow
1066The sequence below shows the interaction between the components of the application in a
1067scenario when Client A updates a file that is shared with Client B and C, so they should receive
1068the update too. If the other clients are not online at the time of the update, the Message
1069Queuing Service keeps the update notifications in separate response queues for them until
1070they come online later.
10711. Client A uploads chunks to cloud storage.
10722. Client A updates metadata and commits changes.
10733. Client A gets confirmation and notifications are sent to Clients B and C about the
1074changes.
10754. Client B and C receive metadata changes and download updated chunks.
10768. Data Deduplication
1077Data deduplication is a technique used for eliminating duplicate copies of data to improve
1078storage utilization. It can also be applied to network data transfers to reduce the number of
1079bytes that must be sent. For each new incoming chunk, we can calculate a hash of it and
1080compare that hash with all the hashes of the existing chunks to see if we already have the
1081same chunk present in our storage.
1082We can implement deduplication in two ways in our system:
1083a. Post-process deduplication
1084With post-process deduplication, new chunks are first stored on the storage device and later 
108539
1086some process analyzes the data looking for duplication. The benefit is that clients will not
1087need to wait for the hash calculation or lookup to complete before storing the data, thereby
1088ensuring that there is no degradation in storage performance. Drawbacks of this approach are
10891) We will unnecessarily be storing duplicate data, though for a short time, 2) Duplicate data
1090will be transferred consuming bandwidth.
1091b. In-line deduplication
1092Alternatively, deduplication hash calculations can be done in real-time as the clients are
1093entering data on their device. If our system identifies a chunk that it has already stored, only a
1094reference to the existing chunk will be added in the metadata, rather than a full copy of the
1095chunk. This approach will give us optimal network and storage usage.
10969. Metadata Partitioning
1097To scale out metadata DB, we need to partition it so that it can store information about
1098millions of users and billions of files/chunks. We need to come up with a partitioning scheme
1099that would divide and store our data in different DB servers.
11001. Vertical Partitioning: We can partition our database in such a way that we store tables
1101related to one particular feature on one server. For example, we can store all the user related
1102tables in one database and all files/chunks related tables in another database. Although this
1103approach is straightforward to implement it has some issues:
11041. Will we still have scale issues? What if we have trillions of chunks to be stored and our
1105database cannot support storing such a huge number of records? How would we further
1106partition such tables?
11072. Joining two tables in two separate databases can cause performance and consistency
1108issues. How frequently do we have to join user and file tables?
11092. Range Based Partitioning: What if we store files/chunks in separate partitions based on
1110the first letter of the File Path? In that case, we save all the files starting with the letter ‘A’ in
1111one partition and those that start with the letter ‘B’ into another partition and so on. This
1112approach is called range based partitioning. We can even combine certain less frequently
1113occurring letters into one database partition. We should come up with this partitioning
1114scheme statically so that we can always store/find a file in a predictable manner.
1115The main problem with this approach is that it can lead to unbalanced servers. For example, if
1116we decide to put all files starting with the letter ‘E’ into a DB partition, and later we realize
1117that we have too many files that start with the letter ‘E’, to such an extent that we cannot fit
1118them into one DB partition.
11193. Hash-Based Partitioning: In this scheme we take a hash of the object we are storing and
1120based on this hash we figure out the DB partition to which this object should go. In our case,
1121we can take the hash of the ‘FileID’ of the File object we are storing to determine the partition
1122the file will be stored. Our hashing function will randomly distribute objects into different
1123partitions, e.g., our hashing function can always map any ID to a number between [1…256],
1124and this number would be the partition we will store our object.
112540
1126This approach can still lead to overloaded partitions, which can be solved by using Consistent
1127Hashing.
112810. Caching
1129We can have two kinds of caches in our system. To deal with hot files/chunks we can
1130introduce a cache for Block storage. We can use an off-the-shelf solution like Memcached that
1131can store whole chunks with its respective IDs/Hashes and Block servers before hitting Block
1132storage can quickly check if the cache has desired chunk. Based on clients’ usage pattern we
1133can determine how many cache servers we need. A high-end commercial server can have
1134144GB of memory; one such server can cache 36K chunks.
1135Which cache replacement policy would best fit our needs? When the cache is full, and we
1136want to replace a chunk with a newer/hotter chunk, how would we choose? Least Recently
1137Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least
1138recently used chunk first. Load Similarly, we can have a cache for Metadata DB.
113911. Load Balancer (LB)
1140We can add the Load balancing layer at two places in our system: 1) Between Clients and
1141Block servers and 2) Between Clients and Metadata servers. Initially, a simple Round Robin
1142approach can be adopted that distributes incoming requests equally among backend servers.
1143This LB is simple to implement and does not introduce any overhead. Another benefit of this
1144approach is if a server is dead, LB will take it out of the rotation and will stop sending any
1145traffic to it. A problem with Round Robin LB is, it won’t take server load into consideration. If
1146a server is overloaded or slow, the LB will not stop sending new requests to that server. To
1147handle this, a more intelligent LB solution can be placed that periodically queries backend
1148server about their load and adjusts traffic based on that.
114912. Security, Permissions and File Sharing
1150One of the primary concerns users will have while storing their files in the cloud is the privacy
1151and security of their data, especially since in our system users can share their files with other
1152users or even make them public to share it with everyone. To handle this, we will be storing
1153the permissions of each file in our metadata DB to reflect what files are visible or modifiable
1154by any user.
115541
1156Designing Facebook Messenger
1157Let's design an instant messaging service like Facebook Messenger where users can send text
1158messages to each other through web and mobile interfaces.
11591. What is Facebook Messenger?
1160Facebook Messenger is a software application which provides text-based instant messaging
1161services to its users. Messenger users can chat with their Facebook friends both from cellphones and Facebook’s website.
11622. Requirements and Goals of the System
1163Our Messenger should meet the following requirements:
1164Functional Requirements:
11651. Messenger should support one-on-one conversations between users.
11662. Messenger should keep track of the online/offline statuses of its users.
11673. Messenger should support the persistent storage of chat history.
1168Non-functional Requirements:
11691. Users should have real-time chat experience with minimum latency.
11702. Our system should be highly consistent; users should be able to see the same chat
1171history on all their devices.
11723. Messenger’s high availability is desirable; we can tolerate lower availability in the
1173interest of consistency.
1174Extended Requirements:
1175• Group Chats: Messenger should support multiple people talking to each other in a
1176group.
1177• Push notifications: Messenger should be able to notify users of new messages when they
1178are offline.
11793. Capacity Estimation and Constraints
1180Let’s assume that we have 500 million daily active users and on average each user sends 40
1181messages daily; this gives us 20 billion messages per day.
1182Storage Estimation: Let’s assume that on average a message is 100 bytes, so to store all the
1183messages for one day we would need 2TB of storage.
118420 billion messages * 100 bytes => 2 TB/day
118542
1186To store five years of chat history, we would need 3.6 petabytes of storage.
11872 TB * 365 days * 5 years ~= 3.6 PB
1188Other than the chat messages, we would also need to store users’ information, messages’
1189metadata (ID, Timestamp, etc.). Not to mention, the above calculation doesn’t take data
1190compression and replication into consideration.
1191Bandwidth Estimation: If our service is getting 2TB of data every day, this will give us 25MB
1192of incoming data for each second.
11932 TB / 86400 sec ~= 25 MB/s
1194Since each incoming message needs to go out to another user, we will need the same amount
1195of bandwidth 25MB/s for both upload and download.
1196High level estimates:
1197Total messages 20 billion per day
1198Storage for each day 2TB
1199Storage for 5 years 3.6PB
1200Incomming data 25MB/s
1201Outgoing data 25MB/s
12024. High Level Design
1203At a high-level, we will need a chat server that will be the central piece, orchestrating all the
1204communications between users. When a user wants to send a message to another user, they
1205will connect to the chat server and send the message to the server; the server then passes that
1206message to the other user and also stores it in the database.
120743
1208The detailed workflow would look like this:
12091. User-A sends a message to User-B through the chat server.
12102. The server receives the message and sends an acknowledgment to User-A.
12113. The server stores the message in its database and sends the message to User-B.
12124. User-B receives the message and sends the acknowledgment to the server.
12135. The server notifies User-A that the message has been delivered successfully to User-B.
1214Request flow for sending a message
12151 of 8
12165. Detailed Component Design
1217Let’s try to build a simple solution first where everything runs on one server. At the high level
1218our system needs to handle the following use cases:
12191. Receive incoming messages and deliver outgoing messages.
12202. Store and retrieve messages from the database.
12213. Keep a record of which user is online or has gone offline, and notify all the relevant
1222users about these status changes.
1223Let’s talk about these scenarios one by one:
1224a. Messages Handling
1225How would we efficiently send/receive messages? To send messages, a user needs to
1226connect to the server and post messages for the other users. To get a message from the server,
1227the user has two options:
12281. Pull model: Users can periodically ask the server if there are any new messages for
1229them.
12302. Push model: Users can keep a connection open with the server and can depend upon
1231the server to notify them whenever there are new messages.
1232If we go with our first approach, then the server needs to keep track of messages that are still
1233waiting to be delivered, and as soon as the receiving user connects to the server to ask for any
1234new message, the server can return all the pending messages. To minimize latency for the
1235user, they have to check the server quite frequently, and most of the time they will be getting
1236an empty response if there are no pending message. This will waste a lot of resources and does
1237not look like an efficient solution.
1238If we go with our second approach, where all the active users keep a connection open with the
1239server, then as soon as the server receives a message it can immediately pass the message to
1240the intended user. This way, the server does not need to keep track of the pending messages,
1241and we will have minimum latency, as the messages are delivered instantly on the opened
1242connection.
124344
1244How will clients maintain an open connection with the server? We can use HTTP Long
1245Polling or WebSockets. In long polling, clients can request information from the server with
1246the expectation that the server may not respond immediately. If the server has no new data for
1247the client when the poll is received, instead of sending an empty response, the server holds the
1248request open and waits for response information to become available. Once it does have new
1249information, the server immediately sends the response to the client, completing the open
1250request. Upon receipt of the server response, the client can immediately issue another server
1251request for future updates. This gives a lot of improvements in latencies, throughputs, and
1252performance. The long polling request can timeout or can receive a disconnect from the
1253server, in that case, the client has to open a new request.
1254How can the server keep track of all the opened connection to redirect messages to the
1255users efficiently? The server can maintain a hash table, where “key” would be the UserID and
1256“value” would be the connection object. So whenever the server receives a message for a user,
1257it looks up that user in the hash table to find the connection object and sends the message on
1258the open request.
1259What will happen when the server receives a message for a user who has gone offline? If
1260the receiver has disconnected, the server can notify the sender about the delivery failure. If it
1261is a temporary disconnect, e.g., the receiver’s long-poll request just timed out, then we should
1262expect a reconnect from the user. In that case, we can ask the sender to retry sending the
1263message. This retry could be embedded in the client’s logic so that users don’t have to retype
1264the message. The server can also store the message for a while and retry sending it once the
1265receiver reconnects.
1266How many chat servers we need? Let’s plan for 500 million connections at any time.
1267Assuming a modern server can handle 50K concurrent connections at any time, we would
1268need 10K such servers.
1269How do we know which server holds the connection to which user? We can introduce a
1270software load balancer in front of our chat servers; that can map each UserID to a server to
1271redirect the request.
1272How should the server process a ‘deliver message’ request? The server needs to do the
1273following things upon receiving a new message: 1) Store the message in the database 2) Send
1274the message to the receiver and 3) Send an acknowledgment to the sender.
1275The chat server will first find the server that holds the connection for the receiver and pass the
1276message to that server to send it to the receiver. The chat server can then send the
1277acknowledgment to the sender; we don’t need to wait for storing the message in the database
1278(this can happen in the background). Storing the message is discussed in the next section.
1279How does the messenger maintain the sequencing of the messages? We can store a
1280timestamp with each message, which is the time the message is received by the server. This
1281will still not ensure correct ordering of messages for clients. The scenario where the server
1282timestamp cannot determine the exact order of messages would look like this:
12831. User-1 sends a message M1 to the server for User-2.
128445
12852. The server receives M1 at T1.
12863. Meanwhile, User-2 sends a message M2 to the server for User-1.
12874. The server receives the message M2 at T2, such that T2 > T1.
12885. The server sends message M1 to User-2 and M2 to User-1.
1289So User-1 will see M1 first and then M2, whereas User-2 will see M2 first and then M1.
1290To resolve this, we need to keep a sequence number with every message for each client. This
1291sequence number will determine the exact ordering of messages for EACH user. With this
1292solution, both clients will see a different view of the message sequence, but this view will be
1293consistent for them on all devices.
1294b. Storing and retrieving the messages from the database
1295Whenever the chat server receives a new message, it needs to store it in the database. To do
1296so, we have two options:
12971. Start a separate thread, which will work with the database to store the message.
12982. Send an asynchronous request to the database to store the message.
1299We have to keep certain things in mind while designing our database:
13001. How to efficiently work with the database connection pool.
13012. How to retry failed requests.
13023. Where to log those requests that failed even after some retries.
13034. How to retry these logged requests (that failed after the retry) when all the issues have
1304resolved.
1305Which storage system we should use? We need to have a database that can support a very
1306high rate of small updates and also fetch a range of records quickly. This is required because
1307we have a huge number of small messages that need to be inserted in the database and, while
1308querying, a user is mostly interested in sequentially accessing the messages.
1309We cannot use RDBMS like MySQL or NoSQL like MongoDB because we cannot afford to
1310read/write a row from the database every time a user receives/sends a message. This will not
1311only make the basic operations of our service run with high latency but also create a huge load
1312on databases.
1313Both of our requirements can be easily met with a wide-column database solution like HBase.
1314HBase is a column-oriented key-value NoSQL database that can store multiple values against
1315one key into multiple columns. HBase is modeled after Google’s BigTable and runs on top of
1316Hadoop Distributed File System (HDFS). HBase groups data together to store new data in a
1317memory buffer and, once the buffer is full, it dumps the data to the disk. This way of storage
1318not only helps to store a lot of small data quickly but also fetching rows by the key or scanning
1319ranges of rows. HBase is also an efficient database to store variable sized data, which is also
1320required by our service.
132146
1322How should clients efficiently fetch data from the server? Clients should paginate while
1323fetching data from the server. Page size could be different for different clients, e.g., cell phones
1324have smaller screens, so we need a fewer number of message/conversations in the viewport.
1325c. Managing user’s status
1326We need to keep track of user’s online/offline status and notify all the relevant users whenever
1327a status change happens. Since we are maintaining a connection object on the server for all
1328active users, we can easily figure out the user’s current status from this. With 500M active
1329users at any time, if we have to broadcast each status change to all the relevant active users, it
1330will consume a lot of resources. We can do the following optimization around this:
13311. Whenever a client starts the app, it can pull the current status of all users in their
1332friends’ list.
13332. Whenever a user sends a message to another user that has gone offline, we can send a
1334failure to the sender and update the status on the client.
13353. Whenever a user comes online, the server can always broadcast that status with a delay
1336of a few seconds to see if the user does not go offline immediately.
13374. Clients can pull the status from the server about those users that are being shown on
1338the user’s viewport. This should not be a frequent operation, as the server is
1339broadcasting the online status of users and we can live with the stale offline status of
1340users for a while.
13415. Whenever the client starts a new chat with another user, we can pull the status at that
1342time.
1343Detailed component design for Facebook messenger
134447
1345Design Summary: Clients will open a connection to the chat server to send a message; the
1346server will then pass it to the requested user. All the active users will keep a connection open
1347with the server to receive messages. Whenever a new message arrives, the chat server will
1348push it to the receiving user on the long poll request. Messages can be stored in HBase, which
1349supports quick small updates, and range based searches. The servers can broadcast the online
1350status of a user to other relevant users. Clients can pull status updates for users who are
1351visible in the client’s viewport on a less frequent basis.
13526. Data partitioning
1353Since we will be storing a lot of data (3.6PB for five years), we need to distribute it onto
1354multiple database servers. What will be our partitioning scheme?
1355Partitioning based on UserID: Let’s assume we partition based on the hash of the UserID so
1356that we can keep all messages of a user on the same database. If one DB shard is 4TB, we will
1357have “3.6PB/4TB ~= 900” shards for five years. For simplicity, let’s assume we keep 1K
1358shards. So we will find the shard number by “hash(UserID) % 1000” and then store/retrieve
1359the data from there. This partitioning scheme will also be very quick to fetch chat history for
1360any user.
1361In the beginning, we can start with fewer database servers with multiple shards residing on
1362one physical server. Since we can have multiple database instances on a server, we can easily
1363store multiple partitions on a single server. Our hash function needs to understand this logical
1364partitioning scheme so that it can map multiple logical partitions on one physical server.
1365Since we will store an unlimited history of messages, we can start with a big number of logical
1366partitions, which will be mapped to fewer physical servers, and as our storage demand
1367increases, we can add more physical servers to distribute our logical partitions.
1368Partitioning based on MessageID: If we store different messages of a user on separate
1369database shards, fetching a range of messages of a chat would be very slow, so we should not
1370adopt this scheme.
13717. Cache
1372We can cache a few recent messages (say last 15) in a few recent conversations that are visible
1373in a user’s viewport (say last 5). Since we decided to store all of the user’s messages on one
1374shard, the cache for a user should entirely reside on one machine too.
13758. Load balancing
1376We will need a load balancer in front of our chat servers; that can map each UserID to a server
1377that holds the connection for the user and then direct the request to that server. Similarly, we
1378would need a load balancer for our cache servers.
137948
13809. Fault tolerance and Replication
1381What will happen when a chat server fails? Our chat servers are holding connections with
1382the users. If a server goes down, should we devise a mechanism to transfer those connections
1383to some other server? It’s extremely hard to failover TCP connections to other servers; an
1384easier approach can be to have clients automatically reconnect if the connection is lost.
1385Should we store multiple copies of user messages? We cannot have only one copy of the
1386user’s data, because if the server holding the data crashes or is down permanently, we don’t
1387have any mechanism to recover that data. For this, either we have to store multiple copies of
1388the data on different servers or use techniques like Reed-Solomon encoding to distribute and
1389replicate it.
139010. Extended Requirements
1391a. Group chat
1392We can have separate group-chat objects in our system that can be stored on the chat servers.
1393A group-chat object is identified by GroupChatID and will also maintain a list of people who
1394are part of that chat. Our load balancer can direct each group chat message based on
1395GroupChatID and the server handling that group chat can iterate through all the users of the
1396chat to find the server handling the connection of each user to deliver the message.
1397In databases, we can store all the group chats in a separate table partitioned based on
1398GroupChatID.
1399b. Push notifications
1400In our current design, users can only send messages to active users and if the receiving user is
1401offline, we send a failure to the sending user. Push notifications will enable our system to send
1402messages to offline users.
1403For Push notifications, each user can opt-in from their device (or a web browser) to get
1404notifications whenever there is a new message or event. Each manufacturer maintains a set of
1405servers that handles pushing these notifications to the user.
1406To have push notifications in our system, we would need to set up a Notification server, which
1407will take the messages for offline users and send them to the manufacture’s push notification
1408server, which will then send them to the user’s device.
140949
1410Designing Twitter
1411Let's design a Twitter-like social networking service. Users of the service will be able to post
1412tweets, follow other people, and favorite tweets. Difficulty Level: Medium
14131. What is Twitter?
1414Twitter is an online social networking service where users post and read short 140-character
1415messages called "tweets." Registered users can post and read tweets, but those who are not
1416registered can only read them. Users access Twitter through their website interface, SMS, or
1417mobile app.
14182. Requirements and Goals of the System
1419We will be designing a simpler version of Twitter with the following requirements:
1420Functional Requirements
14211. Users should be able to post new tweets.
14222. A user should be able to follow other users.
14233. Users should be able to mark tweets as favorites.
14244. The service should be able to create and display a user’s timeline consisting of top
1425tweets from all the people the user follows.
14265. Tweets can contain photos and videos.
1427Non-functional Requirements
14281. Our service needs to be highly available.
14292. Acceptable latency of the system is 200ms for timeline generation.
14303. Consistency can take a hit (in the interest of availability); if a user doesn’t see a tweet
1431for a while, it should be fine.
1432Extended Requirements
14331. Searching for tweets.
14342. Replying to a tweet.
14353. Trending topics – current hot topics/searches.
14364. Tagging other users.
14375. Tweet Notification.
14386. Who to follow? Suggestions?
14397. Moments.
144050
14413. Capacity Estimation and Constraints
1442Let’s assume we have one billion total users with 200 million daily active users (DAU). Also
1443assume we have 100 million new tweets every day and on average each user follows 200
1444people.
1445How many favorites per day? If, on average, each user favorites five tweets per day we will
1446have:
1447200M users * 5 favorites => 1B favorites
1448How many total tweet-views will our system generate? Let’s assume on average a user visits
1449their timeline two times a day and visits five other people’s pages. On each page if a user sees
145020 tweets, then our system will generate 28B/day total tweet-views:
1451200M DAU * ((2 + 5) * 20 tweets) => 28B/day
1452Storage Estimates Let’s say each tweet has 140 characters and we need two bytes to store a
1453character without compression. Let’s assume we need 30 bytes to store metadata with each
1454tweet (like ID, timestamp, user ID, etc.). Total storage we would need:
1455100M * (280 + 30) bytes => 30GB/day
1456What would our storage needs be for five years? How much storage we would need for users’
1457data, follows, favorites? We will leave this for the exercise.
1458Not all tweets will have media, let’s assume that on average every fifth tweet has a photo and
1459every tenth has a video. Let’s also assume on average a photo is 200KB and a video is 2MB.
1460This will lead us to have 24TB of new media every day.
1461(100M/5 photos * 200KB) + (100M/10 videos * 2MB) ~= 24TB/day
1462Bandwidth Estimates Since total ingress is 24TB per day, this would translate into
1463290MB/sec.
1464Remember that we have 28B tweet views per day. We must show the photo of every tweet (if it
1465has a photo), but let’s assume that the users watch every 3rd video they see in their timeline.
1466So, total egress will be:
1467(28B * 280 bytes) / 86400s of text => 93MB/s
1468+ (28B/5 * 200KB ) / 86400s of photos => 13GB/S
1469+ (28B/10/3 * 2MB ) / 86400s of Videos => 22GB/s
1470Total ~= 35GB/s
147151
14724. System APIs
1473? Once we've finalized the requirements, it's always a good idea to define
1474the system APIs. This should explicitly state what is expected from the system.
1475We can have SOAP or REST APIs to expose the functionality of our service. Following could
1476be the definition of the API for posting a new tweet:
1477tweet(api_dev_key, tweet_data, tweet_location, user_location, media_ids, maximum_results_to_r
1478eturn)
1479Parameters:
1480api_dev_key (string): The API developer key of a registered account. This will be used to,
1481among other things, throttle users based on their allocated quota.
1482tweet_data (string): The text of the tweet, typically up to 140 characters.
1483tweet_location (string): Optional location (longitude, latitude) this Tweet refers to.
1484user_location (string): Optional location (longitude, latitude) of the user adding the tweet.
1485media_ids (number[]): Optional list of media_ids to be associated with the Tweet. (All the
1486media photo, video, etc. need to be uploaded separately).
1487Returns: (string)
1488A successful post will return the URL to access that tweet. Otherwise, an appropriate HTTP
1489error is returned.
14905. High Level System Design
1491We need a system that can efficiently store all the new tweets, 100M/86400s => 1150 tweets
1492per second and read 28B/86400s => 325K tweets per second. It is clear from the
1493requirements that this will be a read-heavy system.
1494At a high level, we need multiple application servers to serve all these requests with load
1495balancers in front of them for traffic distributions. On the backend, we need an efficient
1496database that can store all the new tweets and can support a huge number of reads. We also
1497need some file storage to store photos and videos.
149852
1499Although our expected daily write load is 100 million and read load is 28 billion tweets. This
1500means on average our system will receive around 1160 new tweets and 325K read requests per
1501second. This traffic will be distributed unevenly throughout the day, though, at peak time we
1502should expect at least a few thousand write requests and around 1M read requests per second.
1503We should keep this in mind while designing the architecture of our system.
15046. Database Schema
1505We need to store data about users, their tweets, their favorite tweets, and people they follow.
1506For choosing between SQL and NoSQL databases to store the above schema, please see
1507‘Database schema’ under Designing Instagram.
15087. Data Sharding
1509Since we have a huge number of new tweets every day and our read load is extremely high too,
1510we need to distribute our data onto multiple machines such that we can read/write it
1511efficiently. We have many options to shard our data; let’s go through them one by one:
1512Sharding based on UserID: We can try storing all the data of a user on one server. While
1513storing, we can pass the UserID to our hash function that will map the user to a database
1514server where we will store all of the user’s tweets, favorites, follows, etc. While querying for
1515tweets/follows/favorites of a user, we can ask our hash function where can we find the data of
1516a user and then read it from there. This approach has a couple of issues:
15171. What if a user becomes hot? There could be a lot of queries on the server holding the
1518user. This high load will affect the performance of our service.
15192. Over time some users can end up storing a lot of tweets or having a lot of follows
1520compared to others. Maintaining a uniform distribution of growing user data is quite
1521difficult.
1522To recover from these situations either we have to repartition/redistribute our data or use
1523consistent hashing.
1524Sharding based on TweetID: Our hash function will map each TweetID to a random server
1525where we will store that Tweet. To search for tweets, we have to query all servers, and each
1526server will return a set of tweets. A centralized server will aggregate these results to return
1527them to the user. Let’s look into timeline generation example; here are the number of steps
1528our system has to perform to generate a user’s timeline:
15291. Our application (app) server will find all the people the user follows.
15302. App server will send the query to all database servers to find tweets from these people.
15313. Each database server will find the tweets for each user, sort them by recency and return
1532the top tweets.
153353
15344. App server will merge all the results and sort them again to return the top results to the
1535user.
1536This approach solves the problem of hot users, but, in contrast to sharding by UserID, we have
1537to query all database partitions to find tweets of a user, which can result in higher latencies.
1538We can further improve our performance by introducing cache to store hot tweets in front of
1539the database servers.
1540Sharding based on Tweet creation time: Storing tweets based on creation time will give us
1541the advantage of fetching all the top tweets quickly and we only have to query a very small set
1542of servers. The problem here is that the traffic load will not be distributed, e.g., while writing,
1543all new tweets will be going to one server and the remaining servers will be sitting idle.
1544Similarly, while reading, the server holding the latest data will have a very high load as
1545compared to servers holding old data.
1546What if we can combine sharding by TweetID and Tweet creation time? If we don’t store
1547tweet creation time separately and use TweetID to reflect that, we can get benefits of both the
1548approaches. This way it will be quite quick to find the latest Tweets. For this, we must make
1549each TweetID universally unique in our system and each TweetID should contain a timestamp
1550too.
1551We can use epoch time for this. Let’s say our TweetID will have two parts: the first part will be
1552representing epoch seconds and the second part will be an auto-incrementing sequence. So, to
1553make a new TweetID, we can take the current epoch time and append an auto-incrementing
1554number to it. We can figure out the shard number from this TweetID and store it there.
1555What could be the size of our TweetID? Let’s say our epoch time starts today, how many bits
1556we would need to store the number of seconds for the next 50 years?
155786400 sec/day * 365 (days a year) * 50 (years) => 1.6B
1558We would need 31 bits to store this number. Since on average we are expecting 1150 new
1559tweets per second, we can allocate 17 bits to store auto incremented sequence; this will make
1560our TweetID 48 bits long. So, every second we can store (2^17 => 130K) new tweets. We can
1561reset our auto incrementing sequence every second. For fault tolerance and better
1562performance, we can have two database servers to generate auto-incrementing keys for us,
1563one generating even numbered keys and the other generating odd numbered keys.
1564If we assume our current epoch seconds are “1483228800,” our TweetID will look like this:
156554
15661483228800 000001
15671483228800 000002
15681483228800 000003
15691483228800 000004
1570…
1571If we make our TweetID 64bits (8 bytes) long, we can easily store tweets for the next 100 years
1572and also store them for mili-seconds granularity.
1573In the above approach, we still have to query all the servers for timeline generation, but our
1574reads (and writes) will be substantially quicker.
15751. Since we don’t have any secondary index (on creation time) this will reduce our write
1576latency.
15772. While reading, we don’t need to filter on creation-time as our primary key has epoch
1578time included in it.
15798. Cache
1580We can introduce a cache for database servers to cache hot tweets and users. We can use an
1581off-the-shelf solution like Memcache that can store the whole tweet objects. Application
1582servers, before hitting database, can quickly check if the cache has desired tweets. Based on
1583clients’ usage patterns we can determine how many cache servers we need.
1584Which cache replacement policy would best fit our needs? When the cache is full and we
1585want to replace a tweet with a newer/hotter tweet, how would we choose? Least Recently Used
1586(LRU) can be a reasonable policy for our system. Under this policy, we discard the least
1587recently viewed tweet first.
1588How can we have a more intelligent cache? If we go with 80-20 rule, that is 20% of tweets
1589generating 80% of read traffic which means that certain tweets are so popular that a majority
1590of people read them. This dictates that we can try to cache 20% of daily read volume from
1591each shard.
1592What if we cache the latest data? Our service can benefit from this approach. Let’s say if 80%
1593of our users see tweets from the past three days only; we can try to cache all the tweets from
1594the past three days. Let’s say we have dedicated cache servers that cache all the tweets from all
1595the users from the past three days. As estimated above, we are getting 100 million new tweets
1596or 30GB of new data every day (without photos and videos). If we want to store all the tweets
1597from last three days, we will need less than 100GB of memory. This data can easily fit into one
1598server, but we should replicate it onto multiple servers to distribute all the read traffic to
1599reduce the load on cache servers. So whenever we are generating a user’s timeline, we can ask
1600the cache servers if they have all the recent tweets for that user. If yes, we can simply return all
1601the data from the cache. If we don’t have enough tweets in the cache, we have to query the
1602backend server to fetch that data. On a similar design, we can try caching photos and videos
1603from the last three days.
160455
1605Our cache would be like a hash table where ‘key’ would be ‘OwnerID’ and ‘value’ would be a
1606doubly linked list containing all the tweets from that user in the past three days. Since we
1607want to retrieve the most recent data first, we can always insert new tweets at the head of the
1608linked list, which means all the older tweets will be near the tail of the linked list. Therefore,
1609we can remove tweets from the tail to make space for newer tweets.
16109. Timeline Generation
1611For a detailed discussion about timeline generation, take a look at Designing Facebook’s
1612Newsfeed.
161310. Replication and Fault Tolerance
1614Since our system is read-heavy, we can have multiple secondary database servers for each DB
1615partition. Secondary servers will be used for read traffic only. All writes will first go to the
1616primary server and then will be replicated to secondary servers. This scheme will also give us
1617fault tolerance, since whenever the primary server goes down we can failover to a secondary
1618server.
161911. Load Balancing
1620We can add Load balancing layer at three places in our system 1) Between Clients and
1621Application servers 2) Between Application servers and database replication servers and 3)
1622Between Aggregation servers and Cache server. Initially, a simple Round Robin approach can
1623be adopted; that distributes incoming requests equally among servers. This LB is simple to
1624implement and does not introduce any overhead. Another benefit of this approach is that if a
1625server is dead, LB will take it out of the rotation and will stop sending any traffic to it. A 
162656
1627problem with Round Robin LB is that it won’t take servers load into consideration. If a server
1628is overloaded or slow, the LB will not stop sending new requests to that server. To handle this,
1629a more intelligent LB solution can be placed that periodically queries backend server about
1630their load and adjusts traffic based on that.
163112. Monitoring
1632Having the ability to monitor our systems is crucial. We should constantly collect data to get
1633an instant insight into how our system is doing. We can collect following metrics/counters to
1634get an understanding of the performance of our service:
16351. New tweets per day/second, what is the daily peak?
16362. Timeline delivery stats, how many tweets per day/second our service is delivering.
16373. Average latency that is seen by the user to refresh timeline.
1638By monitoring these counters, we will realize if we need more replication, load balancing, or
1639caching.
164013. Extended Requirements
1641How do we serve feeds? Get all the latest tweets from the people someone follows and
1642merge/sort them by time. Use pagination to fetch/show tweets. Only fetch top N tweets from
1643all the people someone follows. This N will depend on the client’s Viewport, since on a mobile
1644we show fewer tweets compared to a Web client. We can also cache next top tweets to speed
1645things up.
1646Alternately, we can pre-generate the feed to improve efficiency; for details please see ‘Ranking
1647and timeline generation’ under Designing Instagram.
1648Retweet: With each Tweet object in the database, we can store the ID of the original Tweet
1649and not store any contents on this retweet object.
1650Trending Topics: We can cache most frequently occurring hashtags or search queries in the
1651last N seconds and keep updating them after every M seconds. We can rank trending topics
1652based on the frequency of tweets or search queries or retweets or likes. We can give more
1653weight to topics which are shown to more people.
1654Who to follow? How to give suggestions? This feature will improve user engagement. We can
1655suggest friends of people someone follows. We can go two or three levels down to find famous
1656people for the suggestions. We can give preference to people with more followers.
1657As only a few suggestions can be made at any time, use Machine Learning (ML) to shuffle and
1658re-prioritize. ML signals could include people with recently increased follow-ship, common
1659followers if the other person is following this user, common location or interests, etc.
166057
1661Moments: Get top news for different websites for past 1 or 2 hours, figure out related tweets,
1662prioritize them, categorize them (news, support, financial, entertainment, etc.) using ML –
1663supervised learning or Clustering. Then we can show these articles as trending topics in
1664Moments.
1665Search: Search involves Indexing, Ranking, and Retrieval of tweets. A similar solution is
1666discussed in our next problem Design Twitter Search.
166758
1668Designing Youtube or Netflix
1669Let's design a video sharing service like Youtube, where users will be able to
1670upload/view/search videos. Similar Services: netflix.com, vimeo.com, dailymotion.com,
1671veoh.com Difficulty Level: Medium
16721. Why Youtube?
1673Youtube is one of the most popular video sharing websites in the world. Users of the service
1674can upload, view, share, rate, and report videos as well as add comments on videos.
16752. Requirements and Goals of the System
1676For the sake of this exercise, we plan to design a simpler version of Youtube with following
1677requirements:
1678Functional Requirements:
16791. Users should be able to upload videos.
16802. Users should be able to share and view videos.
16813. Users should be able to perform searches based on video titles.
16824. Our services should be able to record stats of videos, e.g., likes/dislikes, total number of
1683views, etc.
16845. Users should be able to add and view comments on videos.
1685Non-Functional Requirements:
16861. The system should be highly reliable, any video uploaded should not be lost.
16872. The system should be highly available. Consistency can take a hit (in the interest of
1688availability); if a user doesn’t see a video for a while, it should be fine.
16893. Users should have a real time experience while watching videos and should not feel any
1690lag.
1691Not in scope: Video recommendations, most popular videos, channels, subscriptions, watch
1692later, favorites, etc.
16933. Capacity Estimation and Constraints
1694Let’s assume we have 1.5 billion total users, 800 million of whom are daily active users. If, on
1695average, a user views five videos per day then the total video-views per second would be:
1696800M * 5 / 86400 sec => 46K videos/sec
1697Let’s assume our upload:view ratio is 1:200, i.e., for every video upload we have 200 videos
1698viewed, giving us 230 videos uploaded per second.
169959
170046K / 200 => 230 videos/sec
1701Storage Estimates: Let’s assume that every minute 500 hours worth of videos are uploaded to
1702Youtube. If on average, one minute of video needs 50MB of storage (videos need to be stored
1703in multiple formats), the total storage needed for videos uploaded in a minute would be:
1704500 hours * 60 min * 50MB => 1500 GB/min (25 GB/sec)
1705These numbers are estimated with ignoring video compression and replication, which would
1706change our estimates.
1707Bandwidth estimates: With 500 hours of video uploads per minute and assuming each video
1708upload takes a bandwidth of 10MB/min, we would be getting 300GB of uploads every minute.
1709500 hours * 60 mins * 10MB => 300GB/min (5GB/sec)
1710Assuming an upload:view ratio of 1:200, we would need 1TB/s outgoing bandwidth.
17114. System APIs
1712We can have SOAP or REST APIs to expose the functionality of our service. The following
1713could be the definitions of the APIs for uploading and searching videos:
1714uploadVideo(api_dev_key, video_title, vide_description, tags[], category_id, default_language
1715, recording_details, video_contents)
1716Parameters:
1717api_dev_key (string): The API developer key of a registered account. This will be used to,
1718among other things, throttle users based on their allocated quota.
1719video_title (string): Title of the video.
1720vide_description (string): Optional description of the video.
1721tags (string[]): Optional tags for the video.
1722category_id (string): Category of the video, e.g., Film, Song, People, etc.
1723default_language (string): For example English, Mandarin, Hindi, etc.
1724recording_details (string): Location where the video was recorded.
1725video_contents (stream): Video to be uploaded.
1726Returns: (string)
1727A successful upload will return HTTP 202 (request accepted) and once the video encoding is
1728completed the user is notified through email with a link to access the video. We can also
1729expose a queryable API to let users know the current status of their uploaded video.
1730searchVideo(api_dev_key, search_query, user_location, maximum_videos_to_return, page_token)
1731Parameters:
1732api_dev_key (string): The API developer key of a registered account of our service.
1733search_query (string): A string containing the search terms.
1734user_location (string): Optional location of the user performing the search.
173560
1736maximum_videos_to_return (number): Maximum number of results returned in one request.
1737page_token (string): This token will specify a page in the result set that should be returned.
1738Returns: (JSON)
1739A JSON containing information about the list of video resources matching the search query.
1740Each video resource will have a video title, a thumbnail, a video creation date, and a view
1741count.
1742streamVideo(api_dev_key, video_id, offset, codec, resolution)
1743Parameters:
1744api_dev_key (string): The API developer key of a registered account of our service.
1745video_id (string): A string to identify the video.
1746offset (number): We should be able to stream video from any offset; this offset would be a
1747time in seconds from the beginning of the video. If we support playing/pausing a video from
1748multiple devices, we will need to store the offset on the server. This will enable the users to
1749start watching a video on any device from the same point where they left off.
1750codec (string) & resolution(string): We should send the codec and resolution info in the API
1751from the client to support play/pause from multiple devices. Imagine you are watching a video
1752on your TV’s Netflix app, paused it, and started watching it on your phone’s Netflix app. In
1753this case, you would need codec and resolution, as both these devices have a different
1754resolution and use a different codec.
1755Returns: (STREAM)
1756A media stream (a video chunk) from the given offset.
17575. High Level Design
1758At a high-level we would need the following components:
17591. Processing Queue: Each uploaded video will be pushed to a processing queue to be dequeued later for encoding, thumbnail generation, and storage.
17602. Encoder: To encode each uploaded video into multiple formats.
17613. Thumbnails generator: To generate a few thumbnails for each video.
17624. Video and Thumbnail storage: To store video and thumbnail files in some distributed
1763file storage.
17645. User Database: To store user’s information, e.g., name, email, address, etc.
17656. Video metadata storage: A metadata database to store all the information about videos
1766like title, file path in the system, uploading user, total views, likes, dislikes, etc. It will
1767also be used to store all the video comments.
176861
1769High level design of Youtube
17706. Database Schema
1771Video metadata storage - MySql
1772Videos metadata can be stored in a SQL database. The following information should be stored
1773with each video:
1774• VideoID
1775• Title
1776• Description
1777• Size
1778• Thumbnail
1779• Uploader/User
1780• Total number of likes
1781• Total number of dislikes
1782• Total number of views
1783For each video comment, we need to store following information:
1784• CommentID
1785• VideoID
1786• UserID
1787• Comment
1788• TimeOfCreation
1789User data storage - MySql
1790• UserID, Name, email, address, age, registration details etc.
179162
17927. Detailed Component Design
1793The service would be read-heavy, so we will focus on building a system that can retrieve
1794videos quickly. We can expect our read:write ratio to be 200:1, which means for every video
1795upload there are 200 video views.
1796Where would videos be stored? Videos can be stored in a distributed file storage system
1797like HDFS or GlusterFS.
1798How should we efficiently manage read traffic? We should segregate our read traffic from
1799write traffic. Since we will have multiple copies of each video, we can distribute our read
1800traffic on different servers. For metadata, we can have master-slave configurations where
1801writes will go to master first and then gets applied at all the slaves. Such configurations can
1802cause some staleness in data, e.g., when a new video is added, its metadata would be inserted
1803in the master first and before it gets applied at the slave our slaves would not be able to see it;
1804and therefore it will be returning stale results to the user. This staleness might be acceptable
1805in our system as it would be very short-lived and the user would be able to see the new videos
1806after a few milliseconds.
1807Where would thumbnails be stored? There will be a lot more thumbnails than videos. If we
1808assume that every video will have five thumbnails, we need to have a very efficient storage
1809system that can serve a huge read traffic. There will be two consideration before deciding
1810which storage system should be used for thumbnails:
18111. Thumbnails are small files with, say, a maximum 5KB each.
18122. Read traffic for thumbnails will be huge compared to videos. Users will be watching one
1813video at a time, but they might be looking at a page that has 20 thumbnails of other
1814videos.
1815Let’s evaluate storing all the thumbnails on a disk. Given that we have a huge number of files,
1816we have to perform a lot of seeks to different locations on the disk to read these files. This is
1817quite inefficient and will result in higher latencies.
1818Bigtable can be a reasonable choice here as it combines multiple files into one block to store
1819on the disk and is very efficient in reading a small amount of data. Both of these are the two
1820most significant requirements of our service. Keeping hot thumbnails in the cache will also
1821help in improving the latencies and, given that thumbnails files are small in size, we can easily
1822cache a large number of such files in memory.
1823Video Uploads: Since videos could be huge, if while uploading the connection drops we
1824should support resuming from the same point.
1825Video Encoding: Newly uploaded videos are stored on the server and a new task is added to
1826the processing queue to encode the video into multiple formats. Once all the encoding will be
1827completed the uploader will be notified and the video is made available for view/sharing.
182863
1829Detailed component design of Youtube
18308. Metadata Sharding
1831Since we have a huge number of new videos every day and our read load is extremely high,
1832therefore, we need to distribute our data onto multiple machines so that we can perform
1833read/write operations efficiently. We have many options to shard our data. Let’s go through
1834different strategies of sharding this data one by one:
1835Sharding based on UserID: We can try storing all the data for a particular user on one server.
1836While storing, we can pass the UserID to our hash function which will map the user to a
1837database server where we will store all the metadata for that user’s videos. While querying for
1838videos of a user, we can ask our hash function to find the server holding the user’s data and
1839then read it from there. To search videos by titles we will have to query all servers and each
1840server will return a set of videos. A centralized server will then aggregate and rank these
1841results before returning them to the user.
1842This approach has a couple of issues:
18431. What if a user becomes popular? There could be a lot of queries on the server holding
1844that user; this could create a performance bottleneck. This will also affect the overall
1845performance of our service.
18462. Over time, some users can end up storing a lot of videos compared to others.
1847Maintaining a uniform distribution of growing user data is quite tricky.
1848To recover from these situations either we have to repartition/redistribute our data or used
1849consistent hashing to balance the load between servers.
185064
1851Sharding based on VideoID: Our hash function will map each VideoID to a random server
1852where we will store that Video’s metadata. To find videos of a user we will query all servers
1853and each server will return a set of videos. A centralized server will aggregate and rank these
1854results before returning them to the user. This approach solves our problem of popular users
1855but shifts it to popular videos.
1856We can further improve our performance by introducing a cache to store hot videos in front of
1857the database servers.
18589. Video Deduplication
1859With a huge number of users uploading a massive amount of video data our service will have
1860to deal with widespread video duplication. Duplicate videos often differ in aspect ratios or
1861encodings, can contain overlays or additional borders, or can be excerpts from a longer
1862original video. The proliferation of duplicate videos can have an impact on many levels:
18631. Data Storage: We could be wasting storage space by keeping multiple copies of the same
1864video.
18652. Caching: Duplicate videos would result in degraded cache efficiency by taking up space
1866that could be used for unique content.
18673. Network usage: Duplicate videos will also increase the amount of data that must be sent
1868over the network to in-network caching systems.
18694. Energy consumption: Higher storage, inefficient cache, and network usage could result
1870in energy wastage.
1871For the end user, these inefficiencies will be realized in the form of duplicate search results,
1872longer video startup times, and interrupted streaming.
1873For our service, deduplication makes most sense early; when a user is uploading a video as
1874compared to post-processing it to find duplicate videos later. Inline deduplication will save us
1875a lot of resources that can be used to encode, transfer, and store the duplicate copy of the
1876video. As soon as any user starts uploading a video, our service can run video matching
1877algorithms (e.g., Block Matching, Phase Correlation, etc.) to find duplications. If we already
1878have a copy of the video being uploaded, we can either stop the upload and use the existing
1879copy or continue the upload and use the newly uploaded video if it is of higher quality. If the
1880newly uploaded video is a subpart of an existing video or, vice versa, we can intelligently
1881divide the video into smaller chunks so that we only upload the parts that are missing.
188210. Load Balancing
1883We should use Consistent Hashing among our cache servers, which will also help in balancing
1884the load between cache servers. Since we will be using a static hash-based scheme to map
1885videos to hostnames it can lead to an uneven load on the logical replicas due to the different
1886popularity of each video. For instance, if a video becomes popular, the logical replica
1887corresponding to that video will experience more traffic than other servers. These uneven
1888loads for logical replicas can then translate into uneven load distribution on corresponding 
188965
1890physical servers. To resolve this issue any busy server in one location can redirect a client to a
1891less busy server in the same cache location. We can use dynamic HTTP redirections for this
1892scenario.
1893However, the use of redirections also has its drawbacks. First, since our service tries to load
1894balance locally, it leads to multiple redirections if the host that receives the redirection can’t
1895serve the video. Also, each redirection requires a client to make an additional HTTP request; it
1896also leads to higher delays before the video starts playing back. Moreover, inter-tier (or cross
1897data-center) redirections lead a client to a distant cache location because the higher tier
1898caches are only present at a small number of locations.
189911. Cache
1900To serve globally distributed users, our service needs a massive-scale video delivery system.
1901Our service should push its content closer to the user using a large number of geographically
1902distributed video cache servers. We need to have a strategy that will maximize user
1903performance and also evenly distributes the load on its cache servers.
1904We can introduce a cache for metadata servers to cache hot database rows. Using Memcache
1905to cache the data and Application servers before hitting database can quickly check if the
1906cache has the desired rows. Least Recently Used (LRU) can be a reasonable cache eviction
1907policy for our system. Under this policy, we discard the least recently viewed row first.
1908How can we build more intelligent cache? If we go with 80-20 rule, i.e., 20% of daily read
1909volume for videos is generating 80% of traffic, meaning that certain videos are so popular that
1910the majority of people view them; it follows that we can try caching 20% of daily read volume
1911of videos and metadata.
191212. Content Delivery Network (CDN)
1913A CDN is a system of distributed servers that deliver web content to a user based in the
1914geographic locations of the user, the origin of the web page and a content delivery server. Take
1915a look at ‘CDN’ section in our Caching chapter.
1916Our service can move popular videos to CDNs:
1917• CDNs replicate content in multiple places. There’s a better chance of videos being closer
1918to the user and, with fewer hops, videos will stream from a friendlier network.
1919• CDN machines make heavy use of caching and can mostly serve videos out of memory.
1920Less popular videos (1-20 views per day) that are not cached by CDNs can be served by our
1921servers in various data centers.
192266
192313. Fault Tolerance
1924We should use Consistent Hashing for distribution among database servers. Consistent
1925hashing will not only help in replacing a dead server, but also help in distributing load among
1926servers.
192767
1928Designing Typeahead Suggestion
1929Let's design a real-time suggestion service, which will recommend terms to users as they
1930enter text for searching. Similar Services: Auto-suggestions, Typeahead search Difficulty:
1931Medium
19321. What is Typeahead Suggestion?
1933Typeahead suggestions enable users to search for known and frequently searched terms. As
1934the user types into the search box, it tries to predict the query based on the characters the user
1935has entered and gives a list of suggestions to complete the query. Typeahead suggestions help
1936the user to articulate their search queries better. It’s not about speeding up the search process
1937but rather about guiding the users and lending them a helping hand in constructing their
1938search query.
19392. Requirements and Goals of the System
1940Functional Requirements: As the user types in their query, our service should suggest top 10
1941terms starting with whatever the user has typed.
1942Non-function Requirements: The suggestions should appear in real-time. The user should be
1943able to see the suggestions within 200ms.
19443. Basic System Design and Algorithm
1945The problem we are solving is that we have a lot of ‘strings’ that we need to store in such a way
1946that users can search with any prefix. Our service will suggest next terms that will match the
1947given prefix. For example, if our database contains the following terms: cap, cat, captain, or
1948capital and the user has typed in ‘cap’, our system should suggest ‘cap’, ‘captain’ and ‘capital’.
1949Since we’ve got to serve a lot of queries with minimum latency, we need to come up with a
1950scheme that can efficiently store our data such that it can be queried quickly. We can’t depend
1951upon some database for this; we need to store our index in memory in a highly efficient data
1952structure.
1953One of the most appropriate data structures that can serve our purpose is the Trie
1954(pronounced “try”). A trie is a tree-like data structure used to store phrases where each node
1955stores a character of the phrase in a sequential manner. For example, if we need to store ‘cap,
1956cat, caption, captain, capital’ in the trie, it would look like:
195768
1958Now if the user has typed ‘cap’, our service can traverse the trie to go to the node ‘P’ to find all
1959the terms that start with this prefix (e.g., cap-tion, cap-ital etc).
1960We can merge nodes that have only one branch to save storage space. The above trie can be
1961stored like this:
196269
1963Should we have case insensitive trie? For simplicity and search use-case, let’s assume our
1964data is case insensitive.
1965How to find top suggestion? Now that we can find all the terms for a given prefix, how can we
1966find the top 10 terms for the given prefix? One simple solution could be to store the count of
1967searches that terminated at each node, e.g., if users have searched about ‘CAPTAIN’ 100 times
1968and ‘CAPTION’ 500 times, we can store this number with the last character of the phrase.
1969Now if the user types ‘CAP’ we know the top most searched word under the prefix ‘CAP’ is
1970‘CAPTION’. So, to find the top suggestions for a given prefix, we can traverse the sub-tree
1971under it.
1972Given a prefix, how much time will it take to traverse its sub-tree? Given the amount of
1973data we need to index, we should expect a huge tree. Even traversing a sub-tree would take
1974really long, e.g., the phrase ‘system design interview questions’ is 30 levels deep. Since we
1975have very strict latency requirements we do need to improve the efficiency of our solution.
1976Can we store top suggestions with each node? This can surely speed up our searches but will
1977require a lot of extra storage. We can store top 10 suggestions at each node that we can return
1978to the user. We have to bear the big increase in our storage capacity to achieve the required
1979efficiency.
1980We can optimize our storage by storing only references of the terminal nodes rather than
1981storing the entire phrase. To find the suggested terms we need to traverse back using the
1982parent reference from the terminal node. We will also need to store the frequency with each
1983reference to keep track of top suggestions.
198470
1985How would we build this trie? We can efficiently build our trie bottom up. Each parent node
1986will recursively call all the child nodes to calculate their top suggestions and their counts.
1987Parent nodes will combine top suggestions from all of their children to determine their top
1988suggestions.
1989How to update the trie? Assuming five billion searches every day, which would give us
1990approximately 60K queries per second. If we try to update our trie for every query it’ll be
1991extremely resource intensive and this can hamper our read requests, too. One solution to
1992handle this could be to update our trie offline after a certain interval.
1993As the new queries come in we can log them and also track their frequencies. Either we can
1994log every query or do sampling and log every 1000th query. For example, if we don’t want to
1995show a term which is searched for less than 1000 times, it’s safe to log every 1000th searched
1996term.
1997We can have a Map-Reduce (MR) set-up to process all the logging data periodically say every
1998hour. These MR jobs will calculate frequencies of all searched terms in the past hour. We can
1999then update our trie with this new data. We can take the current snapshot of the trie and
2000update it with all the new terms and their frequencies. We should do this offline as we don’t
2001want our read queries to be blocked by update trie requests. We can have two options:
20021. We can make a copy of the trie on each server to update it offline. Once done we can
2003switch to start using it and discard the old one.
20042. Another option is we can have a master-slave configuration for each trie server. We can
2005update slave while the master is serving traffic. Once the update is complete, we can
2006make the slave our new master. We can later update our old master, which can then
2007start serving traffic, too.
2008How can we update the frequencies of typeahead suggestions? Since we are storing
2009frequencies of our typeahead suggestions with each node, we need to update them too! We
2010can update only differences in frequencies rather than recounting all search terms from
2011scratch. If we’re keeping count of all the terms searched in last 10 days, we’ll need to subtract
2012the counts from the time period no longer included and add the counts for the new time
2013period being included. We can add and subtract frequencies based on Exponential Moving
2014Average (EMA) of each term. In EMA, we give more weight to the latest data. It’s also known
2015as the exponentially weighted moving average.
2016After inserting a new term in the trie, we’ll go to the terminal node of the phrase and increase
2017its frequency. Since we’re storing the top 10 queries in each node, it is possible that this
2018particular search term jumped into the top 10 queries of a few other nodes. So, we need to
2019update the top 10 queries of those nodes then. We have to traverse back from the node to all
2020the way up to the root. For every parent, we check if the current query is part of the top 10. If
2021so, we update the corresponding frequency. If not, we check if the current query’s frequency is
2022high enough to be a part of the top 10. If so, we insert this new term and remove the term with
2023the lowest frequency.
202471
2025How can we remove a term from the trie? Let’s say we have to remove a term from the trie
2026because of some legal issue or hate or piracy etc. We can completely remove such terms from
2027the trie when the regular update happens, meanwhile, we can add a filtering layer on each
2028server which will remove any such term before sending them to users.
2029What could be different ranking criteria for suggestions? In addition to a simple count, for
2030terms ranking, we have to consider other factors too, e.g., freshness, user location, language,
2031demographics, personal history etc.
20324. Permanent Storage of the Trie
2033How to store trie in a file so that we can rebuild our trie easily - this will be needed when a
2034machine restarts? We can take a snapshot of our trie periodically and store it in a file. This
2035will enable us to rebuild a trie if the server goes down. To store, we can start with the root
2036node and save the trie level-by-level. With each node, we can store what character it contains
2037and how many children it has. Right after each node, we should put all of its children. Let’s
2038assume we have the following trie:
2039If we store this trie in a file with the above-mentioned scheme, we will have:
2040“C2,A2,R1,T,P,O1,D”. From this, we can easily rebuild our trie.
2041If you’ve noticed, we are not storing top suggestions and their counts with each node. It is
2042hard to store this information; as our trie is being stored top down, we don’t have child nodes
2043created before the parent, so there is no easy way to store their references. For this, we have to
2044recalculate all the top terms with counts. This can be done while we are building the trie. Each
2045node will calculate its top suggestions and pass it to its parent. Each parent node will merge
2046results from all of its children to figure out its top suggestions.
204772
20485. Scale Estimation
2049If we are building a service that has the same scale as that of Google we can expect 5 billion
2050searches every day, which would give us approximately 60K queries per second.
2051Since there will be a lot of duplicates in 5 billion queries, we can assume that only 20% of
2052these will be unique. If we only want to index the top 50% of the search terms, we can get rid
2053of a lot of less frequently searched queries. Let’s assume we will have 100 million unique
2054terms for which we want to build an index.
2055Storage Estimation: If on the average each query consists of 3 words and if the average length
2056of a word is 5 characters, this will give us 15 characters of average query size. Assuming we
2057need 2 bytes to store a character, we will need 30 bytes to store an average query. So total
2058storage we will need:
2059100 million * 30 bytes => 3 GB
2060We can expect some growth in this data every day, but we should also be removing some
2061terms that are not searched anymore. If we assume we have 2% new queries every day and if
2062we are maintaining our index for the last one year, total storage we should expect:
20633GB + (0.02 * 3 GB * 365 days) => 25 GB
20646. Data Partition
2065Although our index can easily fit on one server, we can still partition it in order to meet our
2066requirements of higher efficiency and lower latencies. How can we efficiently partition our
2067data to distribute it onto multiple servers?
2068a. Range Based Partitioning: What if we store our phrases in separate partitions based on
2069their first letter. So we save all the terms starting with the letter ‘A’ in one partition and those
2070that start with the letter ‘B’ into another partition and so on. We can even combine certain less
2071frequently occurring letters into one partition. We should come up with this partitioning
2072scheme statically so that we can always store and search terms in a predictable manner.
2073The main problem with this approach is that it can lead to unbalanced servers, for instance, if
2074we decide to put all terms starting with the letter ‘E’ into one partition, but later we realize
2075that we have too many terms that start with letter ‘E’ that we can’t fit into one partition.
2076We can see that the above problem will happen with every statically defined scheme. It is not
2077possible to calculate if each of our partitions will fit on one server statically.
2078b. Partition based on the maximum capacity of the server: Let’s say we partition our trie
2079based on the maximum memory capacity of the servers. We can keep storing data on a server
2080as long as it has memory available. Whenever a sub-tree cannot fit into a server, we break our
2081partition there to assign that range to this server and move on the next server to repeat this
2082process. Let’s say if our first trie server can store all terms from ‘A’ to ‘AABC’, which mean our
2083next server will store from ‘AABD’ onwards. If our second server could store up to ‘BXA’, the 
208473
2085next server will start from ‘BXB’, and so on. We can keep a hash table to quickly access this
2086partitioning scheme:
2087Server 1, A-AABC
2088Server 2, AABD-BXA
2089Server 3, BXB-CDA
2090For querying, if the user has typed ‘A’ we have to query both server 1 and 2 to find the top
2091suggestions. When the user has typed ‘AA’, we still have to query server 1 and 2, but when the
2092user has typed ‘AAA’ we only need to query server 1.
2093We can have a load balancer in front of our trie servers which can store this mapping and
2094redirect traffic. Also, if we are querying from multiple servers, either we need to merge the
2095results on the server side to calculate the overall top results or make our clients do that. If we
2096prefer to do this on the server side, we need to introduce another layer of servers between load
2097balancers and trie severs (let’s call them aggregator). These servers will aggregate results from
2098multiple trie servers and return the top results to the client.
2099Partitioning based on the maximum capacity can still lead us to hotspots, e.g., if there are a lot
2100of queries for terms starting with ‘cap’, the server holding it will have a high load compared to
2101others.
2102c. Partition based on the hash of the term: Each term will be passed to a hash function,
2103which will generate a server number and we will store the term on that server. This will make
2104our term distribution random and hence minimize hotspots. The disadvantage of this scheme
2105is, to find typeahead suggestions for a term we have to ask all the servers and then aggregate
2106the results.
21077. Cache
2108We should realize that caching the top searched terms will be extremely helpful in our service.
2109There will be a small percentage of queries that will be responsible for most of the traffic. We
2110can have separate cache servers in front of the trie servers holding most frequently searched
2111terms and their typeahead suggestions. Application servers should check these cache servers
2112before hitting the trie servers to see if they have the desired searched terms. This will save us
2113time to traverse the tri.
2114We can also build a simple Machine Learning (ML) model that can try to predict the
2115engagement on each suggestion based on simple counting, personalization, or trending data,
2116and cache these terms beforehand.
21178. Replication and Load Balancer
2118We should have replicas for our trie servers both for load balancing and also for fault
2119tolerance. We also need a load balancer that keeps track of our data partitioning scheme and
2120redirects traffic based on the prefixes.
212174
21229. Fault Tolerance
2123What will happen when a trie server goes down? As discussed above we can have a masterslave configuration; if the master dies, the slave can take over after failover. Any server that
2124comes back up, can rebuild the trie based on the last snapshot.
212510. Typeahead Client
2126We can perform the following optimizations on the client side to improve user’s experience:
21271. The client should only try hitting the server if the user has not pressed any key for
212850ms.
21292. If the user is constantly typing, the client can cancel the in-progress requests.
21303. Initially, the client can wait until the user enters a couple of characters.
21314. Clients can pre-fetch some data from the server to save future requests.
21325. Clients can store the recent history of suggestions locally. Recent history has a very high
2133rate of being reused.
21346. Establishing an early connection with the server turns out to be one of the most
2135important factors. As soon as the user opens the search engine website, the client can
2136open a connection with the server. So when a user types in the first character, the client
2137doesn’t waste time in establishing the connection.
21387. The server can push some part of their cache to CDNs and Internet Service Providers
2139(ISPs) for efficiency.
214011. Personalization
2141Users will receive some typeahead suggestions based on their historical searches, location,
2142language, etc. We can store the personal history of each user separately on the server and also
2143cache them on the client. The server can add these personalized terms in the final set before
2144sending it to the user. Personalized searches should always come before others.
214575
2146Designing an API Rate Limiter
2147Let's design an API Rate Limiter which will throttle users based upon the number of the
2148requests they are sending. Difficulty Level: Medium
21491. What is a Rate Limiter?
2150Imagine we have a service which is receiving a huge number of requests, but it can only serve
2151a limited number of requests per second. To handle this problem we would need some kind of
2152throttling or rate limiting mechanism that would allow only a certain number of requests so
2153our service can respond to all of them. A rate limiter, at a high-level, limits the number of
2154events an entity (user, device, IP, etc.) can perform in a particular time window. For example:
2155• A user can send only one message per second.
2156• A user is allowed only three failed credit card transactions per day.
2157• A single IP can only create twenty accounts per day.
2158In general, a rate limiter caps how many requests a sender can issue in a specific time
2159window. It then blocks requests once the cap is reached.
21602. Why do we need API rate limiting?
2161Rate Limiting helps to protect services against abusive behaviors targeting the application
2162layer like Denial-of-service (DOS) attacks, brute-force password attempts, brute-force credit
2163card transactions, etc. These attacks are usually a barrage of HTTP/S requests which may look
2164like they are coming from real users, but are typically generated by machines (or bots). As a
2165result, these attacks are often harder to detect and can more easily bring down a service,
2166application, or an API.
2167Rate limiting is also used to prevent revenue loss, to reduce infrastructure costs, to stop spam,
2168and to stop online harassment. Following is a list of scenarios that can benefit from Rate
2169limiting by making a service (or API) more reliable:
2170• Misbehaving clients/scripts: Either intentionally or unintentionally, some entities can
2171overwhelm a service by sending a large number of requests. Another scenario could be
2172when a user is sending a lot of lower-priority requests and we want to make sure that it
2173doesn’t affect the high-priority traffic. For example, users sending a high volume of
2174requests for analytics data should not be allowed to hamper critical transactions for
2175other users.
2176• Security: By limiting the number of the second-factor attempts (in 2-factor auth) that
2177the users are allowed to perform, for example, the number of times they’re allowed to
2178try with a wrong password.
2179• To prevent abusive behavior and bad design practices: Without API limits,
2180developers of client applications would use sloppy development tactics, for example,
2181requesting the same information over and over again.
218276
2183• To keep costs and resource usage under control: Services are generally designed for
2184normal input behavior, for example, a user writing a single post in a minute. Computers
2185could easily push thousands/second through an API. Rate limiter enables controls on
2186service APIs.
2187• Revenue: Certain services might want to limit operations based on the tier of their
2188customer’s service and thus create a revenue model based on rate limiting. There could
2189be default limits for all the APIs a service offers. To go beyond that, the user has to buy
2190higher limits
2191• To eliminate spikiness in traffic: Make sure the service stays up for everyone else.
21923. Requirements and Goals of the System
2193Our Rate Limiter should meet the following requirements:
2194Functional Requirements:
21951. Limit the number of requests an entity can send to an API within a time window, e.g.,
219615 requests per second.
21972. The APIs are accessible through a cluster, so the rate limit should be considered across
2198different servers. The user should get an error message whenever the defined threshold
2199is crossed within a single server or across a combination of servers.
2200Non-Functional Requirements:
22011. The system should be highly available. The rate limiter should always work since it
2202protects our service from external attacks.
22032. Our rate limiter should not introduce substantial latencies affecting the user experience.
22044. How to do Rate Limiting?
2205Rate Limiting is a process that is used to define the rate and speed at which consumers can
2206access APIs. Throttling is the process of controlling the usage of the APIs by customers during
2207a given period. Throttling can be defined at the application level and/or API level. When a
2208throttle limit is crossed, the server returns HTTP status “429 - Too many requests".
22095. What are different types of throttling?
2210Here are the three famous throttling types that are used by different services:
2211Hard Throttling: The number of API requests cannot exceed the throttle limit.
2212Soft Throttling: In this type, we can set the API request limit to exceed a certain percentage.
2213For example, if we have rate-limit of 100 messages a minute and 10% exceed-limit, our rate
2214limiter will allow up to 110 messages per minute.
221577
2216Elastic or Dynamic Throttling: Under Elastic throttling, the number of requests can go
2217beyond the threshold if the system has some resources available. For example, if a user is
2218allowed only 100 messages a minute, we can let the user send more than 100 messages a
2219minute when there are free resources available in the system.
22206. What are different types of algorithms used for Rate Limiting?
2221Following are the two types of algorithms used for Rate Limiting:
2222Fixed Window Algorithm: In this algorithm, the time window is considered from the start of
2223the time-unit to the end of the time-unit. For example, a period would be considered 0-60
2224seconds for a minute irrespective of the time frame at which the API request has been made.
2225In the diagram below, there are two messages between 0-1 second and three messages
2226between 1-2 seconds. If we have a rate limiting of two messages a second, this algorithm will
2227throttle only ‘m5’.
2228Rolling Window Algorithm: In this algorithm, the time window is considered from the
2229fraction of the time at which the request is made plus the time window length. For example, if
2230there are two messages sent at the 300th millisecond and 400th millisecond of a second, we’ll
2231count them as two messages from the 300th millisecond of that second up to the
2232300th millisecond of next second. In the above diagram, keeping two messages a second, we’ll
2233throttle ‘m3’ and ‘m4’.
22347. High level design for Rate Limiter
2235Rate Limiter will be responsible for deciding which request will be served by the API servers
2236and which request will be declined. Once a new request arrives, the Web Server first asks the
2237Rate Limiter to decide if it will be served or throttled. If the request is not throttled, then it’ll
2238be passed to the API servers.
223978
2240High level design for Rate Limiter
22418. Basic System Design and Algorithm
2242Let’s take the example where we want to limit the number of requests per user. Under this
2243scenario, for each unique user, we would keep a count representing how many requests the
2244user has made and a timestamp when we started counting the requests. We can keep it in a
2245hashtable, where the ‘key’ would be the ‘UserID’ and ‘value’ would be a structure containing
2246an integer for the ‘Count’ and an integer for the Epoch time:
2247Let’s assume our rate limiter is allowing three requests per minute per user, so whenever a
2248new request comes in, our rate limiter will perform the following steps:
22491. If the ‘UserID’ is not present in the hash-table, insert it, set the ‘Count’ to 1, set
2250‘StartTime’ to the current time (normalized to a minute), and allow the request.
22512. Otherwise, find the record of the ‘UserID’ and if CurrentTime – StartTime >= 1 min, set
2252the ‘StartTime’ to the current time, ‘Count’ to 1, and allow the request.
22533. If CurrentTime - StartTime <= 1 min and
2254o If ‘Count < 3’, increment the Count and allow the request.
2255o If ‘Count >= 3’, reject the request.
225679
2257What are some of the problems with our algorithm?
22581. This is a Fixed Window algorithm since we’re resetting the ‘StartTime’ at the end of
2259every minute, which means it can potentially allow twice the number of requests per
2260minute. Imagine if Kristie sends three requests at the last second of a minute, then she
2261can immediately send three more requests at the very first second of the next minute,
2262resulting in 6 requests in the span of two seconds. The solution to this problem would
2263be a sliding window algorithm which we’ll discuss later.
22642. Atomicity: In a distributed environment, the “read-and-then-write” behavior can create
2265a race condition. Imagine if Kristie’s current ‘Count’ is “2” and that she issues two more
2266requests. If two separate processes served each of these requests and concurrently read
2267the Count before either of them updated it, each process would think that Kristie could
2268have one more request and that she had not hit the rate limit.
2269If we are using Redis to store our key-value, one solution to resolve the atomicity problem is
2270to use Redis lock for the duration of the read-update operation. This, however, would come at
2271the expense of slowing down concurrent requests from the same user and introducing another
2272layer of complexity. We can use Memcached, but it would have comparable complications.
2273If we are using a simple hash-table, we can have a custom implementation for ‘locking’ each
2274record to solve our atomicity problems.
2275How much memory would we need to store all of the user data? Let’s assume the simple
2276solution where we are keeping all of the data in a hash-table.
2277Let’s assume ‘UserID’ takes 8 bytes. Let’s also assume a 2 byte ‘Count’, which can count up to
227865k, is sufficient for our use case. Although epoch time will need 4 bytes, we can choose to
2279store only the minute and second part, which can fit into 2 bytes. Hence, we need a total of 12
2280bytes to store a user’s data:
22818 + 2 + 2 = 12 bytes
2282Let’s assume our hash-table has an overhead of 20 bytes for each record. If we need to track
2283one million users at any time, the total memory we would need would be 32MB:
2284(12 + 20) bytes * 1 million => 32MB
2285If we assume that we would need a 4-byte number to lock each user’s record to resolve our
2286atomicity problems, we would require a total 36MB memory.
2287This can easily fit on a single server; however we would not like to route all of our traffic
2288through a single machine. Also, if we assume a rate limit of 10 requests per second, this would 
228980
2290translate into 10 million QPS for our rate limiter! This would be too much for a single server.
2291Practically, we can assume we would use a Redis or Memcached kind of a solution in a
2292distributed setup. We’ll be storing all the data in the remote Redis servers and all the Rate
2293Limiter servers will read (and update) these servers before serving or throttling any request.
22949. Sliding Window algorithm
2295We can maintain a sliding window if we can keep track of each request per user. We can store
2296the timestamp of each request in a Redis Sorted Set in our ‘value’ field of hash-table.
2297Let’s assume our rate limiter is allowing three requests per minute per user, so, whenever a
2298new request comes in, the Rate Limiter will perform following steps:
22991. Remove all the timestamps from the Sorted Set that are older than “CurrentTime - 1
2300minute”.
23012. Count the total number of elements in the sorted set. Reject the request if this count is
2302greater than our throttling limit of “3”.
23033. Insert the current time in the sorted set and accept the request.
2304How much memory would we need to store all of the user data for sliding window? Let’s
2305assume ‘UserID’ takes 8 bytes. Each epoch time will require 4 bytes. Let’s suppose we need a
2306rate limiting of 500 requests per hour. Let’s assume 20 bytes overhead for hash-table and 20
2307bytes overhead for the Sorted Set. At max, we would need a total of 12KB to store one user’s
2308data:
23098 + (4 + 20 (sorted set overhead)) * 500 + 20 (hash-table overhead) = 12KB
2310Here we are reserving 20 bytes overhead per element. In a sorted set, we can assume that we
2311need at least two pointers to maintain order among elements — one pointer to the previous
2312element and one to the next element. On a 64bit machine, each pointer will cost 8 bytes. So
2313we will need 16 bytes for pointers. We added an extra word (4 bytes) for storing other
2314overhead.
2315If we need to track one million users at any time, total memory we would need would be
231612GB:
231712KB * 1 million ~= 12GB
2318Sliding Window Algorithm takes a lot of memory compared to the Fixed Window; this would
2319be a scalability issue. What if we can combine the above two algorithms to optimize our
2320memory usage?
232181
232210. Sliding Window with Counters
2323What if we keep track of request counts for each user using multiple fixed time windows, e.g.,
23241/60th the size of our rate limit’s time window. For example, if we have an hourly rate limit we
2325can keep a count for each minute and calculate the sum of all counters in the past hour when
2326we receive a new request to calculate the throttling limit. This would reduce our memory
2327footprint. Let’s take an example where we rate-limit at 500 requests per hour with an
2328additional limit of 10 requests per minute. This means that when the sum of the counters with
2329timestamps in the past hour exceeds the request threshold (500), Kristie has exceeded the
2330rate limit. In addition to that, she can’t send more than ten requests per minute. This would
2331be a reasonable and practical consideration, as none of the real users would send frequent
2332requests. Even if they do, they will see success with retries since their limits get reset every
2333minute.
2334We can store our counters in a Redis Hash since it offers incredibly efficient storage for fewer
2335than 100 keys. When each request increments a counter in the hash, it also sets the hash
2336to expire an hour later. We will normalize each ‘time’ to a minute.
2337How much memory we would need to store all the user data for sliding window with
2338counters? Let’s assume ‘UserID’ takes 8 bytes. Each epoch time will need 4 bytes, and the
2339Counter would need 2 bytes. Let’s suppose we need a rate limiting of 500 requests per hour.
2340Assume 20 bytes overhead for hash-table and 20 bytes for Redis hash. Since we’ll keep a
2341count for each minute, at max, we would need 60 entries for each user. We would need a total
2342of 1.6KB to store one user’s data:
23438 + (4 + 2 + 20 (Redis hash overhead)) * 60 + 20 (hash-table overhead) = 1.6KB
2344If we need to track one million users at any time, total memory we would need would be
23451.6GB:
23461.6KB * 1 million ~= 1.6GB
2347So, our ‘Sliding Window with Counters’ algorithm uses 86% less memory than the simple
2348sliding window algorithm.
234911. Data Sharding and Caching
2350We can shard based on the ‘UserID’ to distribute the user’s data. For fault tolerance and
2351replication we should use Consistent Hashing. If we want to have different throttling limits for
2352different APIs, we can choose to shard per user per API. Take the example of URL Shortener;
2353we can have different rate limiter for createURL() and deleteURL() APIs for each user or IP.
2354If our APIs are partitioned, a practical consideration could be to have a separate (somewhat
2355smaller) rate limiter for each API shard as well. Let’s take the example of our URL Shortener
2356where we want to limit each user not to create more than 100 short URLs per hour. Assuming
2357we are using Hash-Based Partitioning for our createURL() API, we can rate limit each 
235882
2359partition to allow a user to create not more than three short URLs per minute in addition to
2360100 short URLs per hour.
2361Our system can get huge benefits from caching recent active users. Application servers can
2362quickly check if the cache has the desired record before hitting backend servers. Our rate
2363limiter can significantly benefit from the Write-back cache by updating all counters and
2364timestamps in cache only. The write to the permanent storage can be done at fixed intervals.
2365This way we can ensure minimum latency added to the user’s requests by the rate limiter. The
2366reads can always hit the cache first; which will be extremely useful once the user has hit their
2367maximum limit and the rate limiter will only be reading data without any updates.
2368Least Recently Used (LRU) can be a reasonable cache eviction policy for our system.
236912. Should we rate limit by IP or by user?
2370Let’s discuss the pros and cons of using each one of these schemes:
2371IP: In this scheme, we throttle requests per-IP; although it’s not optimal in terms of
2372differentiating between ‘good’ and ‘bad’ actors, it’s still better than not have rate limiting at
2373all. The biggest problem with IP based throttling is when multiple users share a single public
2374IP like in an internet cafe or smartphone users that are using the same gateway. One bad user
2375can cause throttling to other users. Another issue could arise while caching IP-based limits, as
2376there are a huge number of IPv6 addresses available to a hacker from even one computer, it’s
2377trivial to make a server run out of memory tracking IPv6 addresses!
2378User: Rate limiting can be done on APIs after user authentication. Once authenticated, the
2379user will be provided with a token which the user will pass with each request. This will ensure
2380that we will rate limit against a particular API that has a valid authentication token. But what
2381if we have to rate limit on the login API itself? The weakness of this rate-limiting would be
2382that a hacker can perform a denial of service attack against a user by entering wrong
2383credentials up to the limit; after that the actual user will not be able to log-in.
2384How about if we combine the above two schemes?
2385Hybrid: A right approach could be to do both per-IP and per-user rate limiting, as they both
2386have weaknesses when implemented alone, though, this will result in more cache entries with
2387more details per entry, hence requiring more memory and storage.
238883
2389Designing Twitter Search
2390Twitter is one of the largest social networking service where users can share photos, news,
2391and text-based messages. In this chapter, we will design a service that can store and search
2392user tweets. Similar Problems: Tweet search. Difficulty Level: Medium
23931. What is Twitter Search?
2394Twitter users can update their status whenever they like. Each status (called tweet) consists of
2395plain text and our goal is to design a system that allows searching over all the user tweets.
23962. Requirements and Goals of the System
2397• Let’s assume Twitter has 1.5 billion total users with 800 million daily active users.
2398• On average Twitter gets 400 million tweets every day.
2399• The average size of a tweet is 300 bytes.
2400• Let’s assume there will be 500M searches every day.
2401• The search query will consist of multiple words combined with AND/OR.
2402We need to design a system that can efficiently store and query tweets.
24033. Capacity Estimation and Constraints
2404Storage Capacity: Since we have 400 million new tweets every day and each tweet on average
2405is 300 bytes, the total storage we need, will be:
2406400M * 300 => 120GB/day
2407Total storage per second:
2408120GB / 24hours / 3600sec ~= 1.38MB/second
24094. System APIs
2410We can have SOAP or REST APIs to expose functionality of our service; following could be the
2411definition of the search API:
2412search(api_dev_key, search_terms, maximum_results_to_return, sort, page_token)
2413Parameters:
2414api_dev_key (string): The API developer key of a registered account. This will be used to,
2415among other things, throttle users based on their allocated quota.
2416search_terms (string): A string containing the search terms.
2417maximum_results_to_return (number): Number of tweets to return.
2418sort (number): Optional sort mode: Latest first (0 - default), Best matched (1), Most liked (2).
2419page_token (string): This token will specify a page in the result set that should be returned.
242084
2421Returns: (JSON)
2422A JSON containing information about a list of tweets matching the search query. Each result
2423entry can have the user ID & name, tweet text, tweet ID, creation time, number of likes, etc.
24245. High Level Design
2425At the high level, we need to store all the statues in a database and also build an index that can
2426keep track of which word appears in which tweet. This index will help us quickly find tweets
2427that users are trying to search.
2428High level design for Twitter search
24296. Detailed Component Design
24301. Storage: We need to store 120GB of new data every day. Given this huge amount of data, we
2431need to come up with a data partitioning scheme that will be efficiently distributing the data
2432onto multiple servers. If we plan for next five years, we will need the following storage:
2433120GB * 365days * 5years ~= 200TB
2434If we never want to be more than 80% full at any time, we approximately will need 250TB of
2435total storage. Let’s assume that we want to keep an extra copy of all tweets for fault tolerance;
2436then, our total storage requirement will be 500TB. If we assume a modern server can store up
2437to 4TB of data, we would need 125 such servers to hold all of the required data for the next five
2438years.
2439Let’s start with a simplistic design where we store the tweets in a MySQL database. We can
2440assume that we store the tweets in a table having two columns, TweetID and TweetText. Let’s
2441assume we partition our data based on TweetID. If our TweetIDs are unique system-wide, we
2442can define a hash function that can map a TweetID to a storage server where we can store that
2443tweet object.
2444How can we create system-wide unique TweetIDs? If we are getting 400M new tweets each
2445day, then how many tweet objects we can expect in five years?
2446400M * 365 days * 5 years => 730 billion
244785
2448This means we would need a five bytes number to identify TweetIDs uniquely. Let’s assume
2449we have a service that can generate a unique TweetID whenever we need to store an object
2450(The TweetID discussed here will be similar to TweetID discussed in Designing Twitter). We
2451can feed the TweetID to our hash function to find the storage server and store our tweet object
2452there.
24532. Index: What should our index look like? Since our tweet queries will consist of words, let’s
2454build the index that can tell us which word comes in which tweet object. Let’s first estimate
2455how big our index will be. If we want to build an index for all the English words and some
2456famous nouns like people names, city names, etc., and if we assume that we have around
2457300K English words and 200K nouns, then we will have 500k total words in our index. Let’s
2458assume that the average length of a word is five characters. If we are keeping our index in
2459memory, we need 2.5MB of memory to store all the words:
2460500K * 5 => 2.5 MB
2461Let’s assume that we want to keep the index in memory for all the tweets from only past two
2462years. Since we will be getting 730B tweets in 5 years, this will give us 292B tweets in two
2463years. Given that each TweetID will be 5 bytes, how much memory will we need to store all the
2464TweetIDs?
2465292B * 5 => 1460 GB
2466So our index would be like a big distributed hash table, where ‘key’ would be the word and
2467‘value’ will be a list of TweetIDs of all those tweets which contain that word. Assuming on
2468average we have 40 words in each tweet and since we will not be indexing prepositions and
2469other small words like ‘the’, ‘an’, ‘and’ etc., let’s assume we will have around 15 words in each
2470tweet that need to be indexed. This means each TweetID will be stored 15 times in our index.
2471So total memory we will need to store our index:
2472(1460 * 15) + 2.5MB ~= 21 TB
2473Assuming a high-end server has 144GB of memory, we would need 152 such servers to hold
2474our index.
2475We can shard our data based on two criteria:
2476Sharding based on Words: While building our index, we will iterate through all the words of a
2477tweet and calculate the hash of each word to find the server where it would be indexed. To
2478find all tweets containing a specific word we have to query only the server which contains this
2479word.
2480We have a couple of issues with this approach:
24811. What if a word becomes hot? Then there will be a lot of queries on the server holding
2482that word. This high load will affect the performance of our service.
248386
24842. Over time, some words can end up storing a lot of TweetIDs compared to others,
2485therefore, maintaining a uniform distribution of words while tweets are growing is quite
2486tricky.
2487To recover from these situations we either have to repartition our data or use Consistent
2488Hashing.
2489Sharding based on the tweet object: While storing, we will pass the TweetID to our hash
2490function to find the server and index all the words of the tweet on that server. While querying
2491for a particular word, we have to query all the servers, and each server will return a set of
2492TweetIDs. A centralized server will aggregate these results to return them to the user.
2493Detailed component design
24947. Fault Tolerance
2495What will happen when an index server dies? We can have a secondary replica of each server
2496and if the primary server dies it can take control after the failover. Both primary and
2497secondary servers will have the same copy of the index.
2498What if both primary and secondary servers die at the same time? We have to allocate a new
2499server and rebuild the same index on it. How can we do that? We don’t know what
2500words/tweets were kept on this server. If we were using ‘Sharding based on the tweet object’,
2501the brute-force solution would be to iterate through the whole database and filter TweetIDs
2502using our hash function to figure out all the required tweets that would be stored on this
2503server. This would be inefficient and also during the time when the server was being rebuilt
2504we would not be able to serve any query from it, thus missing some tweets that should have
2505been seen by the user.
2506How can we efficiently retrieve a mapping between tweets and the index server? We have to
2507build a reverse index that will map all the TweetID to their index server. Our Index-Builder
2508server can hold this information. We will need to build a Hashtable where the ‘key’ will be the 
250987
2510index server number and the ‘value’ will be a HashSet containing all the TweetIDs being kept
2511at that index server. Notice that we are keeping all the TweetIDs in a HashSet; this will enable
2512us to add/remove tweets from our index quickly. So now, whenever an index server has to
2513rebuild itself, it can simply ask the Index-Builder server for all the tweets it needs to store and
2514then fetch those tweets to build the index. This approach will surely be fast. We should also
2515have a replica of the Index-Builder server for fault tolerance.
25168. Cache
2517To deal with hot tweets we can introduce a cache in front of our database. We can
2518use Memcached, which can store all such hot tweets in memory. Application servers, before
2519hitting the backend database, can quickly check if the cache has that tweet. Based on clients’
2520usage patterns, we can adjust how many cache servers we need. For cache eviction policy,
2521Least Recently Used (LRU) seems suitable for our system.
25229. Load Balancing
2523We can add a load balancing layer at two places in our system 1) Between Clients and
2524Application servers and 2) Between Application servers and Backend server. Initially, a simple
2525Round Robin approach can be adopted; that distributes incoming requests equally among
2526backend servers. This LB is simple to implement and does not introduce any overhead.
2527Another benefit of this approach is LB will take dead servers out of the rotation and will stop
2528sending any traffic to it. A problem with Round Robin LB is it won’t take server load into
2529consideration. If a server is overloaded or slow, the LB will not stop sending new requests to
2530that server. To handle this, a more intelligent LB solution can be placed that periodically
2531queries the backend server about their load and adjust traffic based on that.
253210. Ranking
2533How about if we want to rank the search results by social graph distance, popularity,
2534relevance, etc?
2535Let’s assume we want to rank tweets by popularity, like how many likes or comments a tweet
2536is getting, etc. In such a case, our ranking algorithm can calculate a ‘popularity number’
2537(based on the number of likes etc.) and store it with the index. Each partition can sort the
2538results based on this popularity number before returning results to the aggregator server. The
2539aggregator server combines all these results, sorts them based on the popularity number, and
2540sends the top results to the user.
254188
2542Designing a Web Crawler
2543Let's design a Web Crawler that will systematically browse and download the World Wide
2544Web. Web crawlers are also known as web spiders, robots, worms, walkers, and bots.
2545Difficulty Level: Hard
25461. What is a Web Crawler?
2547A web crawler is a software program which browses the World Wide Web in a methodical and
2548automated manner. It collects documents by recursively fetching links from a set of starting
2549pages. Many sites, particularly search engines, use web crawling as a means of providing upto-date data. Search engines download all the pages to create an index on them to perform
2550faster searches.
2551Some other uses of web crawlers are:
2552• To test web pages and links for valid syntax and structure.
2553• To monitor sites to see when their structure or contents change.
2554• To maintain mirror sites for popular Web sites.
2555• To search for copyright infringements.
2556• To build a special-purpose index, e.g., one that has some understanding of the content
2557stored in multimedia files on the Web.
25582. Requirements and Goals of the System
2559Let’s assume we need to crawl all the web.
2560Scalability: Our service needs to be scalable such that it can crawl the entire Web and can be
2561used to fetch hundreds of millions of Web documents.
2562Extensibility: Our service should be designed in a modular way with the expectation that new
2563functionality will be added to it. There could be newer document types that needs to be
2564downloaded and processed in the future.
25653. Some Design Considerations
2566Crawling the web is a complex task, and there are many ways to go about it. We should be
2567asking a few questions before going any further:
2568Is it a crawler for HTML pages only? Or should we fetch and store other types of media,
2569such as sound files, images, videos, etc.? This is important because the answer can change
2570the design. If we are writing a general-purpose crawler to download different media types, we
2571might want to break down the parsing module into different sets of modules: one for HTML,
2572another for images, or another for videos, where each module extracts what is considered
2573interesting for that media type.
257489
2575Let’s assume for now that our crawler is going to deal with HTML only, but it should be
2576extensible and make it easy to add support for new media types.
2577What protocols are we looking at? HTTP? What about FTP links? What different protocols
2578should our crawler handle? For the sake of the exercise, we will assume HTTP. Again, it
2579shouldn’t be hard to extend the design to use FTP and other protocols later.
2580What is the expected number of pages we will crawl? How big will the URL database
2581become? Let’s assume we need to crawl one billion websites. Since a website can contain
2582many, many URLs, let’s assume an upper bound of 15 billion different web pages that will be
2583reached by our crawler.
2584What is ‘RobotsExclusion’ and how should we deal with it? Courteous Web crawlers
2585implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of their
2586sites off limits to crawlers. The Robots Exclusion Protocol requires a Web crawler to fetch a
2587special document called robot.txt which contains these declarations from a Web site before
2588downloading any real content from it.
25894. Capacity Estimation and Constraints
2590If we want to crawl 15 billion pages within four weeks, how many pages do we need to fetch
2591per second?
259215B / (4 weeks * 7 days * 86400 sec) ~= 6200 pages/sec
2593What about storage? Page sizes vary a lot, but, as mentioned above since, we will be dealing
2594with HTML text only, let’s assume an average page size of 100KB. With each page, if we are
2595storing 500 bytes of metadata, total storage we would need:
259615B * (100KB + 500) ~= 1.5 petabytes
2597Assuming a 70% capacity model (we don’t want to go above 70% of the total capacity of our
2598storage system), total storage we will need:
25991.5 petabytes / 0.7 ~= 2.14 petabytes
26005. High Level design
2601The basic algorithm executed by any Web crawler is to take a list of seed URLs as its input and
2602repeatedly execute the following steps.
26031. Pick a URL from the unvisited URL list.
26042. Determine the IP Address of its host-name.
26053. Establish a connection to the host to download the corresponding document.
26064. Parse the document contents to look for new URLs.
26075. Add the new URLs to the list of unvisited URLs.
26086. Process the downloaded document, e.g., store it or index its contents, etc.
260990
26107. Go back to step 1
2611How to crawl?
2612Breadth first or depth first? Breadth-first search (BFS) is usually used. However, Depth First
2613Search (DFS) is also utilized in some situations, such as, if your crawler has already
2614established a connection with the website, it might just DFS all the URLs within this website
2615to save some handshaking overhead.
2616Path-ascending crawling: Path-ascending crawling can help discover a lot of isolated
2617resources or resources for which no inbound link would have been found in regular crawling
2618of a particular Web site. In this scheme, a crawler would ascend to every path in each URL
2619that it intends to crawl. For example, when given a seed URL
2620of http://foo.com/a/b/page.html, it will attempt to crawl /a/b/, /a/, and /.
2621Difficulties in implementing efficient web crawler
2622There are two important characteristics of the Web that makes Web crawling a very difficult
2623task:
26241. Large volume of Web pages: A large volume of web pages implies that web crawler can
2625only download a fraction of the web pages at any time and hence it is critical that web crawler
2626should be intelligent enough to prioritize download.
26272. Rate of change on web pages. Another problem with today’s dynamic world is that web
2628pages on the internet change very frequently. As a result, by the time the crawler is
2629downloading the last page from a site, the page may change, or a new page may be added to
2630the site.
2631A bare minimum crawler needs at least these components:
26321. URL frontier: To store the list of URLs to download and also prioritize which URLs should
2633be crawled first.
26342. HTTP Fetcher: To retrieve a web page from the server.
26353. Extractor: To extract links from HTML documents.
26364. Duplicate Eliminator: To make sure the same content is not extracted twice
2637unintentionally.
26385. Datastore: To store retrieved pages, URLs, and other metadata.
263991
26406. Detailed Component Design
2641Let’s assume our crawler is running on one server and all the crawling is done by multiple
2642working threads where each working thread performs all the steps needed to download and
2643process a document in a loop.
2644The first step of this loop is to remove an absolute URL from the shared URL frontier for
2645downloading. An absolute URL begins with a scheme (e.g., “HTTP”) which identifies the
2646network protocol that should be used to download it. We can implement these protocols in a
2647modular way for extensibility, so that later if our crawler needs to support more protocols, it
2648can be easily done. Based on the URL’s scheme, the worker calls the appropriate protocol
2649module to download the document. After downloading, the document is placed into a
2650Document Input Stream (DIS). Putting documents into DIS will enable other modules to reread the document multiple times.
2651Once the document has been written to the DIS, the worker thread invokes the dedupe test to
2652determine whether this document (associated with a different URL) has been seen before. If
2653so, the document is not processed any further and the worker thread removes the next URL
2654from the frontier.
2655Next, our crawler needs to process the downloaded document. Each document can have a
2656different MIME type like HTML page, Image, Video, etc. We can implement these MIME
2657schemes in a modular way, so that later if our crawler needs to support more types, we can
2658easily implement them. Based on the downloaded document’s MIME type, the worker invokes
2659the process method of each processing module associated with that MIME type.
2660Furthermore, our HTML processing module will extract all links from the page. Each link is
2661converted into an absolute URL and tested against a user-supplied URL filter to determine if
2662it should be downloaded. If the URL passes the filter, the worker performs the URL-seen test,
2663which checks if the URL has been seen before, namely, if it is in the URL frontier or has
2664already been downloaded. If the URL is new, it is added to the frontier.
266592
2666Let’s discuss these components one by one, and see how they can be distributed onto multiple
2667machines:
26681. The URL frontier: The URL frontier is the data structure that contains all the URLs that
2669remain to be downloaded. We can crawl by performing a breadth-first traversal of the Web,
2670starting from the pages in the seed set. Such traversals are easily implemented by using a
2671FIFO queue.
2672Since we’ll be having a huge list of URLs to crawl, we can distribute our URL frontier into
2673multiple servers. Let’s assume on each server we have multiple worker threads performing the
2674crawling tasks. Let’s also assume that our hash function maps each URL to a server which will
2675be responsible for crawling it.
2676Following politeness requirements must be kept in mind while designing a distributed URL
2677frontier:
26781. Our crawler should not overload a server by downloading a lot of pages from it.
26792. We should not have multiple machines connecting a web server.
2680To implement this politeness constraint our crawler can have a collection of distinct FIFO
2681sub-queues on each server. Each worker thread will have its separate sub-queue, from which
2682it removes URLs for crawling. When a new URL needs to be added, the FIFO sub-queue in
2683which it is placed will be determined by the URL’s canonical hostname. Our hash function can
2684map each hostname to a thread number. Together, these two points imply that, at most, one 
268593
2686worker thread will download documents from a given Web server and also, by using FIFO
2687queue, it’ll not overload a Web server.
2688How big will our URL frontier be? The size would be in the hundreds of millions of URLs.
2689Hence, we need to store our URLs on a disk. We can implement our queues in such a way that
2690they have separate buffers for enqueuing and dequeuing. Enqueue buffer, once filled, will be
2691dumped to the disk, whereas dequeue buffer will keep a cache of URLs that need to be visited;
2692it can periodically read from disk to fill the buffer.
26932. The fetcher module: The purpose of a fetcher module is to download the document
2694corresponding to a given URL using the appropriate network protocol like HTTP. As
2695discussed above, webmasters create robot.txt to make certain parts of their websites off limits
2696for the crawler. To avoid downloading this file on every request, our crawler’s HTTP protocol
2697module can maintain a fixed-sized cache mapping host-names to their robot’s exclusion rules.
26983. Document input stream: Our crawler’s design enables the same document to be processed
2699by multiple processing modules. To avoid downloading a document multiple times, we cache
2700the document locally using an abstraction called a Document Input Stream (DIS).
2701A DIS is an input stream that caches the entire contents of the document read from the
2702internet. It also provides methods to re-read the document. The DIS can cache small
2703documents (64 KB or less) entirely in memory, while larger documents can be temporarily
2704written to a backing file.
2705Each worker thread has an associated DIS, which it reuses from document to document. After
2706extracting a URL from the frontier, the worker passes that URL to the relevant protocol
2707module, which initializes the DIS from a network connection to contain the document’s
2708contents. The worker then passes the DIS to all relevant processing modules.
27094. Document Dedupe test: Many documents on the Web are available under multiple,
2710different URLs. There are also many cases in which documents are mirrored on various
2711servers. Both of these effects will cause any Web crawler to download the same document
2712multiple times. To prevent processing of a document more than once, we perform a dedupe
2713test on each document to remove duplication.
2714To perform this test, we can calculate a 64-bit checksum of every processed document and
2715store it in a database. For every new document, we can compare its checksum to all the
2716previously calculated checksums to see the document has been seen before. We can use MD5
2717or SHA to calculate these checksums.
2718How big would be the checksum store? If the whole purpose of our checksum store is to do
2719dedupe, then we just need to keep a unique set containing checksums of all previously
2720processed document. Considering 15 billion distinct web pages, we would need:
272115B * 8 bytes => 120 GB
2722Although this can fit into a modern-day server’s memory, if we don’t have enough memory
2723available, we can keep smaller LRU based cache on each server with everything backed by 
272494
2725persistent storage. The dedupe test first checks if the checksum is present in the cache. If not,
2726it has to check if the checksum resides in the back storage. If the checksum is found, we will
2727ignore the document. Otherwise, it will be added to the cache and back storage.
27285. URL filters: The URL filtering mechanism provides a customizable way to control the set of
2729URLs that are downloaded. This is used to blacklist websites so that our crawler can ignore
2730them. Before adding each URL to the frontier, the worker thread consults the user-supplied
2731URL filter. We can define filters to restrict URLs by domain, prefix, or protocol type.
27326. Domain name resolution: Before contacting a Web server, a Web crawler must use the
2733Domain Name Service (DNS) to map the Web server’s hostname into an IP address. DNS
2734name resolution will be a big bottleneck of our crawlers given the amount of URLs we will be
2735working with. To avoid repeated requests, we can start caching DNS results by building our
2736local DNS server.
27377. URL dedupe test: While extracting links, any Web crawler will encounter multiple links to
2738the same document. To avoid downloading and processing a document multiple times, a URL
2739dedupe test must be performed on each extracted link before adding it to the URL frontier.
2740To perform the URL dedupe test, we can store all the URLs seen by our crawler in canonical
2741form in a database. To save space, we do not store the textual representation of each URL in
2742the URL set, but rather a fixed-sized checksum.
2743To reduce the number of operations on the database store, we can keep an in-memory cache
2744of popular URLs on each host shared by all threads. The reason to have this cache is that links
2745to some URLs are quite common, so caching the popular ones in memory will lead to a high
2746in-memory hit rate.
2747How much storage we would need for URL’s store? If the whole purpose of our checksum is
2748to do URL dedupe, then we just need to keep a unique set containing checksums of all
2749previously seen URLs. Considering 15 billion distinct URLs and 4 bytes for checksum, we
2750would need:
275115B * 4 bytes => 60 GB
2752Can we use bloom filters for deduping? Bloom filters are a probabilistic data structure for set
2753membership testing that may yield false positives. A large bit vector represents the set. An
2754element is added to the set by computing ‘n’ hash functions of the element and setting the
2755corresponding bits. An element is deemed to be in the set if the bits at all ‘n’ of the element’s
2756hash locations are set. Hence, a document may incorrectly be deemed to be in the set, but
2757false negatives are not possible.
2758The disadvantage of using a bloom filter for the URL seen test is that each false positive will
2759cause the URL not to be added to the frontier and, therefore, the document will never be
2760downloaded. The chance of a false positive can be reduced by making the bit vector larger.
276195
27628. Checkpointing: A crawl of the entire Web takes weeks to complete. To guard against
2763failures, our crawler can write regular snapshots of its state to the disk. An interrupted
2764or aborted crawl can easily be restarted from the latest checkpoint.
27657. Fault tolerance
2766We should use consistent hashing for distribution among crawling servers. Consistent hashing
2767will not only help in replacing a dead host, but also help in distributing load among crawling
2768servers.
2769All our crawling servers will be performing regular checkpointing and storing their FIFO
2770queues to disks. If a server goes down, we can replace it. Meanwhile, consistent hashing
2771should shift the load to other servers.
27728. Data Partitioning
2773Our crawler will be dealing with three kinds of data: 1) URLs to visit 2) URL checksums for
2774dedupe 3) Document checksums for dedupe.
2775Since we are distributing URLs based on the hostnames, we can store these data on the same
2776host. So, each host will store its set of URLs that need to be visited, checksums of all the
2777previously visited URLs and checksums of all the downloaded documents. Since we will be
2778using consistent hashing, we can assume that URLs will be redistributed from overloaded
2779hosts.
2780Each host will perform checkpointing periodically and dump a snapshot of all the data it is
2781holding onto a remote server. This will ensure that if a server dies down another server can
2782replace it by taking its data from the last snapshot.
27839. Crawler Traps
2784There are many crawler traps, spam sites, and cloaked content. A crawler trap is a URL or set
2785of URLs that cause a crawler to crawl indefinitely. Some crawler traps are unintentional. For
2786example, a symbolic link within a file system can create a cycle. Other crawler traps are
2787introduced intentionally. For example, people have written traps that dynamically generate an
2788infinite Web of documents. The motivations behind such traps vary. Anti-spam traps are
2789designed to catch crawlers used by spammers looking for email addresses, while other sites
2790use traps to catch search engine crawlers to boost their search ratings.
279196
2792Designing Facebook’s Newsfeed
2793Let's design Facebook's Newsfeed, which would contain posts, photos, videos, and status
2794updates from all the people and pages a user follows. Similar Services: Twitter Newsfeed,
2795Instagram Newsfeed, Quora Newsfeed Difficulty Level: Hard
27961. What is Facebook’s newsfeed?
2797A Newsfeed is the constantly updating list of stories in the middle of Facebook’s homepage.
2798It includes status updates, photos, videos, links, app activity, and ‘likes’ from people, pages,
2799and groups that a user follows on Facebook. In other words, it is a compilation of a complete
2800scrollable version of your friends’ and your life story from photos, videos, locations, status
2801updates, and other activities.
2802For any social media site you design - Twitter, Instagram, or Facebook - you will need some
2803newsfeed system to display updates from friends and followers.
28042. Requirements and Goals of the System
2805Let’s design a newsfeed for Facebook with the following requirements:
2806Functional requirements:
28071. Newsfeed will be generated based on the posts from the people, pages, and groups that
2808a user follows.
28092. A user may have many friends and follow a large number of pages/groups.
28103. Feeds may contain images, videos, or just text.
28114. Our service should support appending new posts as they arrive to the newsfeed for all
2812active users.
2813Non-functional requirements:
28141. Our system should be able to generate any user’s newsfeed in real-time - maximum
2815latency seen by the end user would be 2s.
28162. A post shouldn’t take more than 5s to make it to a user’s feed assuming a new newsfeed
2817request comes in.
28183. Capacity Estimation and Constraints
2819Let’s assume on average a user has 300 friends and follows 200 pages.
2820Traffic estimates: Let’s assume 300M daily active users with each user fetching their timeline
2821an average of five times a day. This will result in 1.5B newsfeed requests per day or
2822approximately 17,500 requests per second.
282397
2824Storage estimates: On average, let’s assume we need to have around 500 posts in every user’s
2825feed that we want to keep in memory for a quick fetch. Let’s also assume that on average each
2826post would be 1KB in size. This would mean that we need to store roughly 500KB of data per
2827user. To store all this data for all the active users we would need 150TB of memory. If a server
2828can hold 100GB we would need around 1500 machines to keep the top 500 posts in memory
2829for all active users.
28304. System APIs
2831? Once we have finalized the requirements, it’s always a good idea to
2832define the system APIs. This should explicitly state what is expected from the
2833system.
2834We can have SOAP or REST APIs to expose the functionality of our service. The following
2835could be the definition of the API for getting the newsfeed:
2836getUserFeed(api_dev_key, user_id, since_id, count, max_id, exclude_replies)
2837Parameters:
2838api_dev_key (string): The API developer key of a registered can be used to, among other
2839things, throttle users based on their allocated quota.
2840user_id (number): The ID of the user for whom the system will generate the newsfeed.
2841since_id (number): Optional; returns results with an ID higher than (that is, more recent
2842than) the specified ID.
2843count (number): Optional; specifies the number of feed items to try and retrieve up to a
2844maximum of 200 per distinct request.
2845max_id (number): Optional; returns results with an ID less than (that is, older than) or equal
2846to the specified ID.
2847exclude_replies(boolean): Optional; this parameter will prevent replies from appearing in
2848the returned timeline.
2849Returns: (JSON) Returns a JSON object containing a list of feed items.
28505. Database Design
2851There are three primary objects: User, Entity (e.g. page, group, etc.), and FeedItem (or Post).
2852Here are some observations about the relationships between these entities:
2853• A User can follow other entities and can become friends with other users.
2854• Both users and entities can post FeedItems which can contain text, images, or videos.
2855• Each FeedItem will have a UserID which will point to the User who created it. For
2856simplicity, let’s assume that only users can create feed items, although, on Facebook
2857Pages can post feed item too.
2858• Each FeedItem can optionally have an EntityID pointing to the page or the group where
2859that post was created.
286098
2861If we are using a relational database, we would need to model two relations: User-Entity
2862relation and FeedItem-Media relation. Since each user can be friends with many people and
2863follow a lot of entities, we can store this relation in a separate table. The “Type” column in
2864“UserFollow” identifies if the entity being followed is a User or Entity. Similarly, we can have a
2865table for FeedMedia relation.
28666. High Level System Design
2867At a high level this problem can be divided into two parts:
2868Feed generation: Newsfeed is generated from the posts (or feed items) of users and entities
2869(pages and groups) that a user follows. So, whenever our system receives a request to generate
2870the feed for a user (say Jane), we will perform the following steps:
28711. Retrieve IDs of all users and entities that Jane follows.
28722. Retrieve latest, most popular and relevant posts for those IDs. These are the potential
2873posts that we can show in Jane’s newsfeed.
28743. Rank these posts based on the relevance to Jane. This represents Jane’s current feed.
28754. Store this feed in the cache and return top posts (say 20) to be rendered on Jane’s feed.
28765. On the front-end, when Jane reaches the end of her current feed, she can fetch the next
287720 posts from the server and so on.
287899
2879One thing to notice here is that we generated the feed once and stored it in the cache. What
2880about new incoming posts from people that Jane follows? If Jane is online, we should have a
2881mechanism to rank and add those new posts to her feed. We can periodically (say every five
2882minutes) perform the above steps to rank and add the newer posts to her feed. Jane can then
2883be notified that there are newer items in her feed that she can fetch.
2884Feed publishing: Whenever Jane loads her newsfeed page, she has to request and pull feed
2885items from the server. When she reaches the end of her current feed, she can pull more data
2886from the server. For newer items either the server can notify Jane and then she can pull, or
2887the server can push, these new posts. We will discuss these options in detail later.
2888At a high level, we will need following components in our Newsfeed service:
28891. Web servers: To maintain a connection with the user. This connection will be used to
2890transfer data between the user and the server.
28912. Application server: To execute the workflows of storing new posts in the database
2892servers. We will also need some application servers to retrieve and to push the newsfeed
2893to the end user.
28943. Metadata database and cache: To store the metadata about Users, Pages, and Groups.
28954. Posts database and cache: To store metadata about posts and their contents.
28965. Video and photo storage, and cache: Blob storage, to store all the media included in
2897the posts.
28986. Newsfeed generation service: To gather and rank all the relevant posts for a user to
2899generate newsfeed and store in the cache. This service will also receive live updates and
2900will add these newer feed items to any user’s timeline.
29017. Feed notification service: To notify the user that there are newer items available for
2902their newsfeed.
2903Following is the high-level architecture diagram of our system. User B and C are following
2904User A.
2905100
2906Facebook Newsfeed Architecture
29077. Detailed Component Design
2908Let’s discuss different components of our system in detail.
2909a. Feed generation
2910Let’s take the simple case of the newsfeed generation service fetching most recent posts from
2911all the users and entities that Jane follows; the query would look like this:
2912(SELECT FeedItemID FROM FeedItem WHERE UserID in (
2913 SELECT EntityOrFriendID FROM UserFollow WHERE UserID = <current_user_id> and type = 0(use
2914r))
2915)
2916UNION
2917(SELECT FeedItemID FROM FeedItem WHERE EntityID in (
2918 SELECT EntityOrFriendID FROM UserFollow WHERE UserID = <current_user_id> and type = 1(ent
2919ity))
2920)
2921ORDER BY CreationDate DESC
2922LIMIT 100
2923Here are issues with this design for the feed generation service:
2924101
29251. Crazy slow for users with a lot of friends/follows as we have to perform
2926sorting/merging/ranking of a huge number of posts.
29272. We generate the timeline when a user loads their page. This would be quite slow and
2928have a high latency.
29293. For live updates, each status update will result in feed updates for all followers. This
2930could result in high backlogs in our Newsfeed Generation Service.
29314. For live updates, the server pushing (or notifying about) newer posts to users could lead
2932to very heavy loads, especially for people or pages that have a lot of followers. To
2933improve the efficiency, we can pre-generate the timeline and store it in a memory.
2934Offline generation for newsfeed: We can have dedicated servers that are continuously
2935generating users’ newsfeed and storing them in memory. So, whenever a user requests for the
2936new posts for their feed, we can simply serve it from the pre-generated, stored location. Using
2937this scheme, user’s newsfeed is not compiled on load, but rather on a regular basis and
2938returned to users whenever they request for it.
2939Whenever these servers need to generate the feed for a user, they will first query to see what
2940was the last time the feed was generated for that user. Then, new feed data would be
2941generated from that time onwards. We can store this data in a hash table where the “key”
2942would be UserID and “value” would be a STRUCT like this:
2943Struct {
2944 LinkedHashMap<FeedItemID, FeedItem> feedItems;
2945 DateTime lastGenerated;
2946}
2947We can store FeedItemIDs in a data structure similar to Linked HashMap or TreeMap, which
2948can allow us to not only jump to any feed item but also iterate through the map easily.
2949Whenever users want to fetch more feed items, they can send the last FeedItemID they
2950currently see in their newsfeed, we can then jump to that FeedItemID in our hash-map and
2951return next batch/page of feed items from there.
2952How many feed items should we store in memory for a user’s feed? Initially, we can decide
2953to store 500 feed items per user, but this number can be adjusted later based on the usage
2954pattern. For example, if we assume that one page of a user’s feed has 20 posts and most of the
2955users never browse more than ten pages of their feed, we can decide to store only 200 posts
2956per user. For any user who wants to see more posts (more than what is stored in memory), we
2957can always query backend servers.
2958Should we generate (and keep in memory) newsfeeds for all users? There will be a lot of
2959users that don’t login frequently. Here are a few things we can do to handle this; 1) a more
2960straightforward approach could be, to use a LRU based cache that can remove users from
2961memory that haven’t accessed their newsfeed for a long time 2) a smarter solution can figure
2962out the login pattern of users to pre-generate their newsfeed, e.g., at what time of the day a
2963user is active and which days of the week does a user access their newsfeed? etc.
2964Let’s now discuss some solutions to our “live updates” problems in the following section.
2965102
2966b. Feed publishing
2967The process of pushing a post to all the followers is called a fanout. By analogy, the push
2968approach is called fanout-on-write, while the pull approach is called fanout-on-load. Let’s
2969discuss different options for publishing feed data to users.
29701. “Pull” model or Fan-out-on-load: This method involves keeping all the recent feed data
2971in memory so that users can pull it from the server whenever they need it. Clients can
2972pull the feed data on a regular basis or manually whenever they need it. Possible
2973problems with this approach are a) New data might not be shown to the users until they
2974issue a pull request, b) It’s hard to find the right pull cadence, as most of the time pull
2975requests will result in an empty response if there is no new data, causing waste of
2976resources.
29772. “Push” model or Fan-out-on-write: For a push system, once a user has published a
2978post, we can immediately push this post to all the followers. The advantage is that when
2979fetching feed you don’t need to go through your friend’s list and get feeds for each of
2980them. It significantly reduces read operations. To efficiently handle this, users have to
2981maintain a Long Poll request with the server for receiving the updates. A possible
2982problem with this approach is that when a user has millions of followers (a celebrityuser) the server has to push updates to a lot of people.
29833. Hybrid: An alternate method to handle feed data could be to use a hybrid approach, i.e.,
2984to do a combination of fan-out-on-write and fan-out-on-load. Specifically, we can stop
2985pushing posts from users with a high number of followers (a celebrity user) and only
2986push data for those users who have a few hundred (or thousand) followers. For celebrity
2987users, we can let the followers pull the updates. Since the push operation can be
2988extremely costly for users who have a lot of friends or followers, by disabling fanout for
2989them, we can save a huge number of resources. Another alternate approach could be
2990that, once a user publishes a post, we can limit the fanout to only her online friends.
2991Also, to get benefits from both the approaches, a combination of ‘push to notify’ and
2992‘pull for serving’ end users is a great way to go. Purely a push or pull model is less
2993versatile.
2994How many feed items can we return to the client in each request? We should have a
2995maximum limit for the number of items a user can fetch in one request (say 20). But, we
2996should let the client specify how many feed items they want with each request as the user may
2997like to fetch a different number of posts depending on the device (mobile vs. desktop).
2998Should we always notify users if there are new posts available for their newsfeed? It could
2999be useful for users to get notified whenever new data is available. However, on mobile devices,
3000where data usage is relatively expensive, it can consume unnecessary bandwidth. Hence, at
3001least for mobile devices, we can choose not to push data, instead, let users “Pull to Refresh” to
3002get new posts.
3003103
30048. Feed Ranking
3005The most straightforward way to rank posts in a newsfeed is by the creation time of the posts,
3006but today’s ranking algorithms are doing a lot more than that to ensure “important” posts are
3007ranked higher. The high-level idea of ranking is first to select key “signals” that make a post
3008important and then to find out how to combine them to calculate a final ranking score.
3009More specifically, we can select features that are relevant to the importance of any feed item,
3010e.g., number of likes, comments, shares, time of the update, whether the post has
3011images/videos, etc., and then, a score can be calculated using these features. This is generally
3012enough for a simple ranking system. A better ranking system can significantly improve itself
3013by constantly evaluating if we are making progress in user stickiness, retention, ads revenue,
3014etc.
30159. Data Partitioning
3016a. Sharding posts and metadata
3017Since we have a huge number of new posts every day and our read load is extremely high too,
3018we need to distribute our data onto multiple machines such that we can read/write it
3019efficiently. For sharding our databases that are storing posts and their metadata, we can have
3020a similar design as discussed under Designing Twitter.
3021b. Sharding feed data
3022For feed data, which is being stored in memory, we can partition it based on UserID. We can
3023try storing all the data of a user on one server. When storing, we can pass the UserID to our
3024hash function that will map the user to a cache server where we will store the user’s feed
3025objects. Also, for any given user, since we don’t expect to store more than 500 FeedItmeIDs,
3026we will not run into a scenario where feed data for a user doesn’t fit on a single server. To get
3027the feed of a user, we would always have to query only one server. For future growth and
3028replication, we must use Consistent Hashing.
3029104
3030Designing Yelp or Nearby Friends
3031Let's design a Yelp like service, where users can search for nearby places like restaurants,
3032theaters, or shopping malls, etc., and can also add/view reviews of places. Similar Services:
3033Proximity server. Difficulty Level: Hard
30341. Why Yelp or Proximity Server?
3035Proximity servers are used to discover nearby attractions like places, events, etc. If you haven’t
3036used yelp.com before, please try it before proceeding (you can search for nearby restaurants,
3037theaters, etc.) and spend some time understanding different options that the website offers.
3038This will help you a lot in understanding this chapter better.
30392. Requirements and Goals of the System
3040What do we wish to achieve from a Yelp like service? Our service will be storing information
3041about different places so that users can perform a search on them. Upon querying, our service
3042will return a list of places around the user.
3043Our Yelp-like service should meet the following requirements:
3044Functional Requirements:
30451. Users should be able to add/delete/update Places.
30462. Given their location (longitude/latitude), users should be able to find all nearby places
3047within a given radius.
30483. Users should be able to add feedback/review about a place. The feedback can have
3049pictures, text, and a rating.
3050Non-functional Requirements:
30511. Users should have a real-time search experience with minimum latency.
30522. Our service should support a heavy search load. There will be a lot of search requests
3053compared to adding a new place.
30543. Scale Estimation
3055Let’s build our system assuming that we have 500M places and 100K queries per second
3056(QPS). Let’s also assume a 20% growth in the number of places and QPS each year.
30574. Database Schema
3058Each Place can have the following fields:
30591. LocationID (8 bytes): Uniquely identifies a location.
30602. Name (256 bytes)
3061105
30623. Latitude (8 bytes)
30634. Longitude (8 bytes)
30645. Description (512 bytes)
30656. Category (1 byte): E.g., coffee shop, restaurant, theater, etc.
3066Although a four bytes number can uniquely identify 500M locations, with future growth in
3067mind, we will go with 8 bytes for LocationID.
3068Total size: 8 + 256 + 8 + 8 + 512 + 1 => 793 bytes
3069We also need to store reviews, photos, and ratings of a Place. We can have a separate table to
3070store reviews for Places:
30711. LocationID (8 bytes)
30722. ReviewID (4 bytes): Uniquely identifies a review, assuming any location will not have
3073more than 2^32 reviews.
30743. ReviewText (512 bytes)
30754. Rating (1 byte): how many stars a place gets out of ten.
3076Similarly, we can have a separate table to store photos for Places and Reviews.
30775. System APIs
3078We can have SOAP or REST APIs to expose the functionality of our service. The following
3079could be the definition of the API for searching:
3080search(api_dev_key, search_terms, user_location, radius_filter, maximum_results_to_return,
3081 category_filter, sort, page_token)
3082Parameters:
3083api_dev_key (string): The API developer key of a registered account. This will be used to,
3084among other things, throttle users based on their allocated quota.
3085search_terms (string): A string containing the search terms.
3086user_location (string): Location of the user performing the search.
3087radius_filter (number): Optional search radius in meters.
3088maximum_results_to_return (number): Number of business results to return.
3089category_filter (string): Optional category to filter search results, e.g., Restaurants, Shopping
3090Centers, etc.
3091sort (number): Optional sort mode: Best matched (0 - default), Minimum distance (1),
3092Highest rated (2).
3093page_token (string): This token will specify a page in the result set that should be returned.
3094Returns: (JSON)
3095A JSON containing information about a list of businesses matching the search query. Each
3096result entry will have the business name, address, category, rating, and thumbnail.
3097106
30986. Basic System Design and Algorithm
3099At a high level, we need to store and index each dataset described above (places, reviews, etc.).
3100For users to query this massive database, the indexing should be read efficient, since while
3101searching for the nearby places users expect to see the results in real-time.
3102Given that the location of a place doesn’t change that often, we don’t need to worry about
3103frequent updates of the data. As a contrast, if we intend to build a service where objects do
3104change their location frequently, e.g., people or taxis, then we might come up with a very
3105different design.
3106Let’s see what are different ways to store this data and also find out which method will suit
3107best for our use cases:
3108a. SQL solution
3109One simple solution could be to store all the data in a database like MySQL. Each place will be
3110stored in a separate row, uniquely identified by LocationID. Each place will have its longitude
3111and latitude stored separately in two different columns, and to perform a fast search; we
3112should have indexes on both these fields.
3113To find all the nearby places of a given location (X, Y) within a radius ‘D’, we can query like
3114this:
3115Select * from Places where Latitude between X-D and X+D and Longitude between Y-D and
3116Y+D
3117The above query is not completely accurate, as we know that to find the distance between two
3118points we have to use the distance formula (Pythagorean theorem), but for simplicity let’s take
3119this.
3120How efficient would this query be? We have estimated 500M places to be stored in our
3121service. Since we have two separate indexes, each index can return a huge list of places and
3122performing an intersection on those two lists won’t be efficient. Another way to look at this
3123problem is that there could be too many locations between ‘X-D’ and ‘X+D’, and similarly
3124between ‘Y-D’ and ‘Y+D’. If we can somehow shorten these lists, it can improve the
3125performance of our query.
3126b. Grids
3127We can divide the whole map into smaller grids to group locations into smaller sets. Each grid
3128will store all the Places residing within a specific range of longitude and latitude. This scheme
3129would enable us to query only a few grids to find nearby places. Based on a given location and
3130radius, we can find all the neighboring grids and then query these grids to find nearby places.
3131107
3132Let’s assume that GridID (a four bytes number) would uniquely identify grids in our system.
3133What could be a reasonable grid size? Grid size could be equal to the distance we would like
3134to query since we also want to reduce the number of grids. If the grid size is equal to the
3135distance we want to query, then we only need to search within the grid which contains the
3136given location and neighboring eight grids. Since our grids would be statically defined (from
3137the fixed grid size), we can easily find the grid number of any location (lat, long) and its
3138neighboring grids.
3139In the database, we can store the GridID with each location and have an index on it, too, for
3140faster searching. Now, our query will look like:
3141Select * from Places where Latitude between X-D and X+D and Longitude between Y-D and
3142Y+D and GridID in (GridID, GridID1, GridID2, ..., GridID8)
3143This will undoubtedly improve the runtime of our query.
3144Should we keep our index in memory? Maintaining the index in memory will improve the
3145performance of our service. We can keep our index in a hash table where ‘key’ is the grid
3146number and ‘value’ is the list of places contained in that grid.
3147How much memory will we need to store the index? Let’s assume our search radius is 10
3148miles; given that the total area of the earth is around 200 million square miles, we will have
314920 million grids. We would need a four bytes number to uniquely identify each grid and, since
3150LocationID is 8 bytes, we would need 4GB of memory (ignoring hash table overhead) to store
3151the index.
3152(4 * 20M) + (8 * 500M) ~= 4 GB
3153108
3154This solution can still run slow for those grids that have a lot of places since our places are not
3155uniformly distributed among grids. We can have a thickly dense area with a lot of places, and
3156on the other hand, we can have areas which are sparsely populated.
3157This problem can be solved if we can dynamically adjust our grid size such that whenever we
3158have a grid with a lot of places we break it down to create smaller grids. A couple of challenges
3159with this approach could be: 1) how to map these grids to locations and 2) how to find all the
3160neighboring grids of a grid.
3161c. Dynamic size grids
3162Let’s assume we don’t want to have more than 500 places in a grid so that we can have a faster
3163searching. So, whenever a grid reaches this limit, we break it down into four grids of equal size
3164and distribute places among them. This means thickly populated areas like downtown San
3165Francisco will have a lot of grids, and sparsely populated area like the Pacific Ocean will have
3166large grids with places only around the coastal lines.
3167What data-structure can hold this information? A tree in which each node has four children
3168can serve our purpose. Each node will represent a grid and will contain information about all
3169the places in that grid. If a node reaches our limit of 500 places, we will break it down to
3170create four child nodes under it and distribute places among them. In this way, all the leaf
3171nodes will represent the grids that cannot be further broken down. So leaf nodes will keep a
3172list of places with them. This tree structure in which each node can have four children is called
3173a QuadTree
3174How will we build a QuadTree? We will start with one node that will represent the whole
3175world in one grid. Since it will have more than 500 locations, we will break it down into four
3176nodes and distribute locations among them. We will keep repeating this process with each
3177child node until there are no nodes left with more than 500 locations.
3178109
3179How will we find the grid for a given location? We will start with the root node and search
3180downward to find our required node/grid. At each step, we will see if the current node we are
3181visiting has children. If it has, we will move to the child node that contains our desired
3182location and repeat this process. If the node does not have any children, then that is our
3183desired node.
3184How will we find neighboring grids of a given grid? Since only leaf nodes contain a list of
3185locations, we can connect all leaf nodes with a doubly linked list. This way we can iterate
3186forward or backward among the neighboring leaf nodes to find out our desired locations.
3187Another approach for finding adjacent grids would be through parent nodes. We can keep a
3188pointer in each node to access its parent, and since each parent node has pointers to all of its
3189children, we can easily find siblings of a node. We can keep expanding our search for
3190neighboring grids by going up through the parent pointers.
3191Once we have nearby LocationIDs, we can query the backend database to find details about
3192those places.
3193What will be the search workflow? We will first find the node that contains the user’s
3194location. If that node has enough desired places, we can return them to the user. If not, we will
3195keep expanding to the neighboring nodes (either through the parent pointers or doubly linked
3196list) until either we find the required number of places or exhaust our search based on the
3197maximum radius.
3198How much memory will be needed to store the QuadTree? For each Place, if we cache only
3199LocationID and Lat/Long, we would need 12GB to store all places.
320024 * 500M => 12 GB
3201Since each grid can have a maximum of 500 places, and we have 500M locations, how many
3202total grids we will have?
3203500M / 500 => 1M grids
3204Which means we will have 1M leaf nodes and they will be holding 12GB of location data. A
3205QuadTree with 1M leaf nodes will have approximately 1/3rd internal nodes, and each internal
3206node will have 4 pointers (for its children). If each pointer is 8 bytes, then the memory we
3207need to store all internal nodes would be:
32081M * 1/3 * 4 * 8 = 10 MB
3209So, total memory required to hold the whole QuadTree would be 12.01GB. This can easily fit
3210into a modern-day server.
3211How would we insert a new Place into our system? Whenever a new Place is added by a
3212user, we need to insert it into the databases as well as in the QuadTree. If our tree resides on
3213one server, it is easy to add a new Place, but if the QuadTree is distributed among different
3214servers, first we need to find the grid/server of the new Place and then add it there (discussed
3215in the next section).
3216110
32177. Data Partitioning
3218What if we have a huge number of places such that our index does not fit into a single
3219machine’s memory? With 20% growth each year we will reach the memory limit of the server
3220in the future. Also, what if one server cannot serve the desired read traffic? To resolve these
3221issues, we must partition our QuadTree!
3222We will explore two solutions here (both of these partitioning schemes can be applied to
3223databases, too):
3224a. Sharding based on regions: We can divide our places into regions (like zip codes), such
3225that all places belonging to a region will be stored on a fixed node. To store a place we will find
3226the server through its region and, similarly, while querying for nearby places we will ask the
3227region server that contains user’s location. This approach has a couple of issues:
32281. What if a region becomes hot? There would be a lot of queries on the server holding that
3229region, making it perform slow. This will affect the performance of our service.
32302. Over time, some regions can end up storing a lot of places compared to others. Hence,
3231maintaining a uniform distribution of places, while regions are growing is quite
3232difficult.
3233To recover from these situations, either we have to repartition our data or use consistent
3234hashing.
3235b. Sharding based on LocationID: Our hash function will map each LocationID to a server
3236where we will store that place. While building our QuadTree, we will iterate through all the
3237places and calculate the hash of each LocationID to find a server where it would be stored. To
3238find places near a location, we have to query all servers and each server will return a set of
3239nearby places. A centralized server will aggregate these results to return them to the user.
3240Will we have different QuadTree structure on different partitions? Yes, this can happen
3241since it is not guaranteed that we will have an equal number of places in any given grid on all
3242partitions. However, we do make sure that all servers have approximately an equal number of
3243Places. This different tree structure on different servers will not cause any issue though, as we
3244will be searching all the neighboring grids within the given radius on all partitions.
3245The remaining part of this chapter assumes that we have partitioned our data based on
3246LocationID.
3247111
32488. Replication and Fault Tolerance
3249Having replicas of QuadTree servers can provide an alternate to data partitioning. To
3250distribute read traffic, we can have replicas of each QuadTree server. We can have a masterslave configuration where replicas (slaves) will only serve read traffic; all write traffic will first
3251go to the master and then applied to slaves. Slaves might not have some recently inserted
3252places (a few milliseconds delay will be there), but this could be acceptable.
3253What will happen when a QuadTree server dies? We can have a secondary replica of each
3254server and, if primary dies, it can take control after the failover. Both primary and secondary
3255servers will have the same QuadTree structure.
3256What if both primary and secondary servers die at the same time? We have to allocate a
3257new server and rebuild the same QuadTree on it. How can we do that, since we don’t know
3258what places were kept on this server? The brute-force solution would be to iterate through the
3259whole database and filter LocationIDs using our hash function to figure out all the required
3260places that will be stored on this server. This would be inefficient and slow; also, during the
3261time when the server is being rebuilt, we will not be able to serve any query from it, thus
3262missing some places that should have been seen by users.
3263How can we efficiently retrieve a mapping between Places and QuadTree server? We have
3264to build a reverse index that will map all the Places to their QuadTree server. We can have a
3265separate QuadTree Index server that will hold this information. We will need to build a
3266HashMap where the ‘key’ is the QuadTree server number and the ‘value’ is a HashSet
3267containing all the Places being kept on that QuadTree server. We need to store LocationID
3268and Lat/Long with each place because information servers can build their QuadTrees through
3269this. Notice that we are keeping Places’ data in a HashSet, this will enable us to add/remove
3270Places from our index quickly. So now, whenever a QuadTree server needs to rebuild itself, it
3271can simply ask the QuadTree Index server for all the Places it needs to store. This approach
3272will surely be quite fast. We should also have a replica of the QuadTree Index server for fault 
3273112
3274tolerance. If a QuadTree Index server dies, it can always rebuild its index from iterating
3275through the database.
32769. Cache
3277To deal with hot Places, we can introduce a cache in front of our database. We can use an offthe-shelf solution like Memcache, which can store all data about hot places. Application
3278servers, before hitting the backend database, can quickly check if the cache has that Place.
3279Based on clients’ usage pattern, we can adjust how many cache servers we need. For cache
3280eviction policy, Least Recently Used (LRU) seems suitable for our system.
328110. Load Balancing (LB)
3282We can add LB layer at two places in our system 1) Between Clients and Application servers
3283and 2) Between Application servers and Backend server. Initially, a simple Round Robin
3284approach can be adopted; that will distribute all incoming requests equally among backend
3285servers. This LB is simple to implement and does not introduce any overhead. Another benefit
3286of this approach is if a server is dead the load balancer will take it out of the rotation and will
3287stop sending any traffic to it.
3288A problem with Round Robin LB is, it won’t take server load into consideration. If a server is
3289overloaded or slow, the load balancer will not stop sending new requests to that server. To
3290handle this, a more intelligent LB solution would be needed that periodically queries backend
3291server about their load and adjusts traffic based on that.
329211. Ranking
3293How about if we want to rank the search results not just by proximity but also by popularity or
3294relevance?
3295How can we return most popular places within a given radius? Let’s assume we keep track
3296of the overall popularity of each place. An aggregated number can represent this popularity in
3297our system, e.g., how many stars a place gets out of ten (this would be an average of different
3298rankings given by users)? We will store this number in the database as well as in the
3299QuadTree. While searching for the top 100 places within a given radius, we can ask each
3300partition of the QuadTree to return the top 100 places with maximum popularity. Then the
3301aggregator server can determine the top 100 places among all the places returned by different
3302partitions.
3303Remember that we didn’t build our system to update place’s data frequently. With this design,
3304how can we modify the popularity of a place in our QuadTree? Although we can search a place
3305and update its popularity in the QuadTree, it would take a lot of resources and can affect
3306search requests and system throughput. Assuming the popularity of a place is not expected to
3307reflect in the system within a few hours, we can decide to update it once or twice a day,
3308especially when the load on the system is minimum.
3309113
3310Our next problem, Designing Uber backend, discusses dynamic updates of the QuadTree in
3311detail.
3312114
3313Designing Uber backend
3314Let's design a ride-sharing service like Uber, which connects passengers who need a ride with
3315drivers who have a car. Similar Services: Lyft, Didi, Via, Sidecar, etc. Difficulty level: Hard
3316Prerequisite: Designing Yelp
33171. What is Uber?
3318Uber enables its customers to book drivers for taxi rides. Uber drivers use their personal cars
3319to drive customers around. Both customers and drivers communicate with each other through
3320their smartphones using the Uber app.
33212. Requirements and Goals of the System
3322Let’s start with building a simpler version of Uber.
3323There are two types of users in our system: 1) Drivers 2) Customers.
3324• Drivers need to regularly notify the service about their current location and their
3325availability to pick passengers.
3326• Passengers get to see all the nearby available drivers.
3327• Customer can request a ride; nearby drivers are notified that a customer is ready to be
3328picked up.
3329• Once a driver and a customer accept a ride, they can constantly see each other’s current
3330location until the trip finishes.
3331• Upon reaching the destination, the driver marks the journey complete to become
3332available for the next ride.
33333. Capacity Estimation and Constraints
3334• Let’s assume we have 300M customers and 1M drivers with 1M daily active customers
3335and 500K daily active drivers.
3336• Let’s assume 1M daily rides.
3337• Let’s assume that all active drivers notify their current location every three seconds.
3338• Once a customer puts in a request for a ride, the system should be able to contact
3339drivers in real-time.
33404. Basic System Design and Algorithm
3341We will take the solution discussed in Designing Yelp and modify it to make it work for the
3342above-mentioned “Uber” use cases. The biggest difference we have is that our QuadTree was
3343not built keeping in mind that there would be frequent updates to it. So, we have two issues
3344with our Dynamic Grid solution:
3345115
3346• Since all active drivers are reporting their locations every three seconds, we need to
3347update our data structures to reflect that. If we have to update the QuadTree for every
3348change in the driver’s position, it will take a lot of time and resources. To update a
3349driver to its new location, we must find the right grid based on the driver’s previous
3350location. If the new position does not belong to the current grid, we have to remove the
3351driver from the current grid and move/reinsert the user to the correct grid. After this
3352move, if the new grid reaches the maximum limit of drivers, we have to repartition it.
3353• We need to have a quick mechanism to propagate the current location of all the nearby
3354drivers to any active customer in that area. Also, when a ride is in progress, our system
3355needs to notify both the driver and passenger about the current location of the car.
3356Although our QuadTree helps us find nearby drivers quickly, a fast update in the tree is not
3357guaranteed.
3358Do we need to modify our QuadTree every time a driver reports their location? If we don’t
3359update our QuadTree with every update from the driver, it will have some old data and will
3360not reflect the current location of drivers correctly. If you recall, our purpose of building the
3361QuadTree was to find nearby drivers (or places) efficiently. Since all active drivers report their
3362location every three seconds, therefore there will be a lot more updates happening to our tree
3363than querying for nearby drivers. So, what if we keep the latest position reported by all drivers
3364in a hash table and update our QuadTree a little less frequently? Let’s assume we guarantee
3365that a driver’s current location will be reflected in the QuadTree within 15 seconds.
3366Meanwhile, we will maintain a hash table that will store the current location reported by
3367drivers; let’s call this DriverLocationHT.
3368How much memory we need for DriverLocationHT? We need to store DriveID, their present
3369and old location, in the hash table. So, we need a total of 35 bytes to store one record:
33701. DriverID (3 bytes - 1 million drivers)
33712. Old latitude (8 bytes)
33723. Old longitude (8 bytes)
33734. New latitude (8 bytes)
33745. New longitude (8 bytes) Total = 35 bytes
3375If we have 1 million total drivers, we need the following memory (ignoring hash table
3376overhead):
33771 million * 35 bytes => 35 MB
3378How much bandwidth will our service consume to receive location updates from all
3379drivers? If we get DriverID and their location, it will be (3+16 => 19 bytes). If we receive this
3380information every three seconds from 500K daily active drivers, we will be getting 9.5MB per
3381three seconds.
3382Do we need to distribute DriverLocationHT onto multiple servers? Although our memory
3383and bandwidth requirements don’t require this, since all this information can easily be stored
3384on one server, but, for scalability, performance, and fault tolerance, we should distribute 
3385116
3386DriverLocationHT onto multiple servers. We can distribute based on the DriverID to make the
3387distribution completely random. Let’s call the machines holding DriverLocationHT the Driver
3388Location server. Other than storing the driver’s location, each of these servers will do two
3389things:
33901. As soon as the server receives an update for a driver’s location, they will broadcast that
3391information to all the interested customers.
33922. The server needs to notify the respective QuadTree server to refresh the driver’s
3393location. As discussed above, this can happen every 10 seconds.
3394How can we efficiently broadcast the driver’s location to customers? We can have a Push
3395Model where the server will push the positions to all the relevant users. We can have a
3396dedicated Notification Service that can broadcast the current location of drivers to all the
3397interested customers. We can build our Notification service on a publisher/subscriber model.
3398When a customer opens the Uber app on their cell phone, they query the server to find nearby
3399drivers. On the server side, before returning the list of drivers to the customer, we will
3400subscribe the customer for all the updates from those drivers. We can maintain a list of
3401customers (subscribers) interested in knowing the location of a driver and, whenever we have
3402an update in DriverLocationHT for that driver, we can broadcast the current location of the
3403driver to all subscribed customers. This way, our system makes sure that we always show the
3404driver’s current position to the customer.
3405How much memory will we need to store all these subscriptions? As we have estimated
3406above, we will have 1M daily active customers and 500K daily active drivers. On average let’s
3407assume that five customers subscribe to one driver. Let’s assume we store all this information
3408in a hash table so that we can update it efficiently. We need to store driver and customer IDs
3409to maintain the subscriptions. Assuming we will need 3 bytes for DriverID and 8 bytes for
3410CustomerID, we will need 21MB of memory.
3411(500K * 3) + (500K * 5 * 8 ) ~= 21 MB
3412How much bandwidth will we need to broadcast the driver’s location to customers? For
3413every active driver, we have five subscribers, so the total subscribers we have:
34145 * 500K => 2.5M
3415To all these customers we need to send DriverID (3 bytes) and their location (16 bytes) every
3416second, so, we need the following bandwidth:
34172.5M * 19 bytes => 47.5 MB/s
3418How can we efficiently implement Notification service? We can either use HTTP long
3419polling or push notifications.
3420How will the new publishers/drivers get added for a current customer? As we have
3421proposed above, customers will be subscribed to nearby drivers when they open the Uber app
3422for the first time, what will happen when a new driver enters the area the customer is looking
3423at? To add a new customer/driver subscription dynamically, we need to keep track of the area 
3424117
3425the customer is watching. This will make our solution complicated; how about if instead of
3426pushing this information, clients pull it from the server?
3427How about if clients pull information about nearby drivers from the server? Clients can
3428send their current location, and the server will find all the nearby drivers from the QuadTree
3429to return them to the client. Upon receiving this information, the client can update their
3430screen to reflect the current positions of the drivers. Clients can query every five seconds to
3431limit the number of round trips to the server. This solution looks simpler compared to the
3432push model described above.
3433Do we need to repartition a grid as soon as it reaches the maximum limit? We can have a
3434cushion to let each grid grow a little bigger beyond the limit before we decide to partition it.
3435Let’s say our grids can grow/shrink an extra 10% before we partition/merge them. This should
3436decrease the load for a grid partition or merge on high traffic grids.
3437How would “Request Ride” use case work?
34381. The customer will put a request for a ride.
34392. One of the Aggregator servers will take the request and asks QuadTree servers to return
3440nearby drivers.
34413. The Aggregator server collects all the results and sorts them by ratings.
34424. The Aggregator server will send a notification to the top (say three) drivers
3443simultaneously, whichever driver accepts the request first will be assigned the ride. The
3444other drivers will receive a cancellation request. If none of the three drivers respond,
3445the Aggregator will request a ride from the next three drivers from the list.
34465. Once a driver accepts a request, the customer is notified.
3447118
34485. Fault Tolerance and Replication
3449What if a Driver Location server or Notification server dies? We would need replicas of
3450these servers, so that if the primary dies the secondary can take control. Also, we can store this
3451data in some persistent storage like SSDs that can provide fast IOs; this will ensure that if
3452both primary and secondary servers die we can recover the data from the persistent storage.
34536. Ranking
3454How about if we want to rank the search results not just by proximity but also by popularity or
3455relevance?
3456How can we return top rated drivers within a given radius? Let’s assume we keep track of
3457the overall ratings of each driver in our database and QuadTree. An aggregated number can
3458represent this popularity in our system, e.g., how many stars does a driver get out of ten?
3459While searching for the top 10 drivers within a given radius, we can ask each partition of the
3460QuadTree to return the top 10 drivers with a maximum rating. The aggregator server can then
3461determine the top 10 drivers among all the drivers returned by different partitions.
34627. Advanced Issues
34631. How will we handle clients on slow and disconnecting networks?
34642. What if a client gets disconnected when they are a part of a ride? How will we handle
3465billing in such a scenario?
34663. How about if clients pull all the information, compared to servers always pushing it?
3467119
3468Design Ticketmaster (*New*)
3469Let's design an online ticketing system that sells movie tickets like Ticketmaster or
3470BookMyShow. Similar Services: bookmyshow.com, ticketmaster.com Difficulty Level: Hard
34711. What is an online movie ticket booking system?
3472A movie ticket booking system provides its customers the ability to purchase theatre seats
3473online. E-ticketing systems allow the customers to browse through movies currently being
3474played and to book seats, anywhere anytime.
34752. Requirements and Goals of the System
3476Our ticket booking service should meet the following requirements:
3477Functional Requirements:
34781. Our ticket booking service should be able to list different cities where its affiliate
3479cinemas are located.
34802. Once the user selects the city, the service should display the movies released in that
3481particular city.
34823. Once the user selects a movie, the service should display the cinemas running that
3483movie and its available show times.
34844. The user should be able to choose a show at a particular cinema and book their tickets.
34855. The service should be able to show the user the seating arrangement of the cinema hall.
3486The user should be able to select multiple seats according to their preference.
34876. The user should be able to distinguish available seats from booked ones.
34887. Users should be able to put a hold on the seats for five minutes before they make a
3489payment to finalize the booking.
34908. The user should be able to wait if there is a chance that the seats might become
3491available, e.g., when holds by other users expire.
34929. Waiting customers should be serviced in a fair, first come, first serve manner.
3493Non-Functional Requirements:
34941. The system would need to be highly concurrent. There will be multiple booking requests
3495for the same seat at any particular point in time. The service should handle this
3496gracefully and fairly.
34972. The core thing of the service is ticket booking, which means financial transactions. This
3498means that the system should be secure and the database ACID compliant.
34993. Some Design Considerations
35001. For simplicity, let’s assume our service does not require any user authentication.
35012. The system will not handle partial ticket orders. Either user gets all the tickets they
3502want or they get nothing.
3503120
35043. Fairness is mandatory for the system.
35054. To stop system abuse, we can restrict users from booking more than ten seats at a time.
35065. We can assume that traffic would spike on popular/much-awaited movie releases and
3507the seats would fill up pretty fast. The system should be scalable and highly available to
3508keep up with the surge in traffic.
35094. Capacity Estimation
3510Traffic estimates: Let’s assume that our service has 3 billion page views per month and sells
351110 million tickets a month.
3512Storage estimates: Let’s assume that we have 500 cities and, on average each city has ten
3513cinemas. If there are 2000 seats in each cinema and on average, there are two shows every
3514day.
3515Let’s assume each seat booking needs 50 bytes (IDs, NumberOfSeats, ShowID, MovieID,
3516SeatNumbers, SeatStatus, Timestamp, etc.) to store in the database. We would also need to
3517store information about movies and cinemas; let’s assume it’ll take 50 bytes. So, to store all
3518the data about all shows of all cinemas of all cities for a day:
3519500 cities * 10 cinemas * 2000 seats * 2 shows * (50+50) bytes = 2GB / day
3520To store five years of this data, we would need around 3.6TB.
35215. System APIs
3522We can have SOAP or REST APIs to expose the functionality of our service. The following
3523could be the definition of the APIs to search movie shows and reserve seats.
3524SearchMovies(api_dev_key, keyword, city, lat_long, radius, start_datetime, end_datetime, post
3525al_code,
3526includeSpellcheck, results_per_page, sorting_order)
3527Parameters:
3528api_dev_key (string): The API developer key of a registered account. This will be used to,
3529among other things, throttle users based on their allocated quota.
3530keyword (string): Keyword to search on.
3531city (string): City to filter movies by.
3532lat_long (string): Latitude and longitude to filter by. radius (number): Radius of the area in
3533which we want to search for events.
3534start_datetime (string): Filter movies with a starting datetime.
3535end_datetime (string): Filter movies with an ending datetime.
3536postal_code (string): Filter movies by postal code / zipcode.
3537includeSpellcheck (Enum: “yes” or “no”): Yes, to include spell check suggestions in the
3538response.
3539results_per_page (number): Number of results to return per page. Maximum is 30.
3540sorting_order (string): Sorting order of the search result. Some allowable values : ‘name,asc’, 
3541121
3542‘name,desc’, ‘date,asc’, ‘date,desc’, ‘distance,asc’, ‘name,date,asc’, ‘name,date,desc’,
3543‘date,name,asc’, ‘date,name,desc’.
3544Returns: (JSON)
3545Here is a sample list of movies and their shows:
3546[
3547 {
3548 "MovieID": 1,
3549 "ShowID": 1,
3550 "Title": "Cars 2",
3551 "Description": "About cars",
3552 "Duration": 120,
3553 "Genre": "Animation",
3554 "Language": "English",
3555 "ReleaseDate": "8th Oct. 2014",
3556 "Country": USA,
3557 "StartTime": "14:00",
3558 "EndTime": "16:00",
3559 "Seats":
3560 [
3561 {
3562 "Type": "Regular"
3563 "Price": 14.99
3564 "Status: "Almost Full"
3565 },
3566 {
3567 "Type": "Premium"
3568 "Price": 24.99
3569 "Status: "Available"
3570 }
3571 ]
3572 },
3573 {
3574ReserveSeats(api_dev_key, session_id, movie_id, show_id, seats_to_reserve[])
3575Parameters:
3576api_dev_key (string): same as above
3577session_id (string): User’s session ID to track this reservation. Once the reservation time
3578expires, user’s reservation on the server will be removed using this ID.
3579movie_id (string): Movie to reserve.
3580show_id (string): Show to reserve.
3581seats_to_reserve (number): An array containing seat IDs to reserve.
3582Returns: (JSON)
3583Returns the status of the reservation, which would be one of the following: 1) “Reservation
3584Successful” 2) “Reservation Failed - Show Full,” 3) “Reservation Failed - Retry, as other users
3585are holding reserved seats”.
3586122
35876. Database Design
3588Here are a few observations about the data we are going to store:
35891. Each City can have multiple Cinemas.
35902. Each Cinema will have multiple halls.
35913. Each Movie will have many Shows and each Show will have multiple Bookings.
35924. A user can have multiple bookings.
35937. High Level Design
3594At a high-level, our web servers will manage users’ sessions and application servers will
3595handle all the ticket management, storing data in the databases as well as working with the
3596cache servers to process reservations.
3597123
35988. Detailed Component Design
3599First, let’s try to build our service assuming it is being served from a single server.
3600Ticket Booking Workflow: The following would be a typical ticket booking workflow:
36011. The user searches for a movie.
36022. The user selects a movie.
36033. The user is shown the available shows of the movie.
36044. The user selects a show.
36055. The user selects the number of seats to be reserved.
36066. If the required number of seats are available, the user is shown a map of the theater to
3607select seats. If not, the user is taken to ‘step 8’ below.
36087. Once the user selects the seat, the system will try to reserve those selected seats.
36098. If seats can’t be reserved, we have the following options:
3610• Show is full; the user is shown the error message.
3611• The seats the user wants to reserve are no longer available, but there are other seats
3612available, so the user is taken back to the theater map to choose different seats.
3613• There are no seats available to reserve, but all the seats are not booked yet, as there are
3614some seats that other users are holding in the reservation pool and have not booked yet.
3615The user will be taken to a waiting page where they can wait until the required seats get
3616freed from the reservation pool. This waiting could result in the following options:
3617o If the required number of seats become available, the user is taken to the theater
3618map page where they can choose seats.
3619o While waiting, if all seats get booked or there are fewer seats in the reservation
3620pool than the user intend to book, the user is shown the error message.
3621o User cancels the waiting and is taken back to the movie search page.
3622124
3623o At maximum, a user can wait one hour, after that user’s session gets expired and
3624the user is taken back to the movie search page.
36259. If seats are reserved successfully, the user has five minutes to pay for the reservation.
3626After payment, booking is marked complete. If the user is not able to pay within five
3627minutes, all their reserved seats are freed to become available to other users.
3628125
3629126
3630127
3631How would the server keep track of all the active reservation that haven’t been booked
3632yet? And how would the server keep track of all the waiting customers?
3633We need two daemon services, one to keep track of all active reservations and remove any
3634expired reservation from the system; let’s call it ActiveReservationService. The other service
3635would be keeping track of all the waiting user requests and, as soon as the required number of
3636seats become available, it will notify the (the longest waiting) user to choose the seats; let’s call
3637it WaitingUserService.
3638128
3639a. ActiveReservationsService
3640We can keep all the reservations of a ‘show’ in memory in a data structure similar to Linked
3641HashMap or a TreeMap in addition to keeping all the data in the database. We will need a
3642linked HashMap kind of data structure that allows us to jump to any reservation to remove it
3643when the booking is complete. Also, since we will have expiry time associated with each
3644reservation, the head of the HashMap will always point to the oldest reservation record so that
3645the reservation can be expired when the timeout is reached.
3646To store every reservation for every show, we can have a HashTable where the ‘key’ would be
3647‘ShowID’ and the ‘value’ would be the Linked HashMap containing ‘BookingID’ and creation
3648‘Timestamp’.
3649In the database, we will store the reservation in the ‘Booking’ table and the expiry time will be
3650in the Timestamp column. The ‘Status’ field will have a value of ‘Reserved (1)’ and, as soon as
3651a booking is complete, the system will update the ‘Status’ to ‘Booked (2)’ and remove the
3652reservation record from the Linked HashMap of the relevant show. When the reservation is
3653expired, we can either remove it from the Booking table or mark it ‘Expired (3)’ in addition to
3654removing it from memory.
3655ActiveReservationsService will also work with the external financial service to process user
3656payments. Whenever a booking is completed, or a reservation gets expired,
3657WaitingUsersService will get a signal so that any waiting customer can be served.
3658b. WaitingUsersService
3659Just like ActiveReservationsService, we can keep all the waiting users of a show in memory in
3660a Linked HashMap or a TreeMap. We need a data structure similar to Linked HashMap so
3661that we can jump to any user to remove them from the HashMap when the user cancels their
3662request. Also, since we are serving in a first-come-first-serve manner, the head of the Linked
3663HashMap would always be pointing to the longest waiting user, so that whenever seats
3664become available, we can serve users in a fair manner.
3665We will have a HashTable to store all the waiting users for every Show. The ‘key’ would be
3666'ShowID, and the ‘value’ would be a Linked HashMap containing ‘UserIDs’ and their waitstart-time.
3667Clients can use Long Polling for keeping themselves updated for their reservation status.
3668Whenever seats become available, the server can use this request to notify the user.
3669Reservation Expiration
3670On the server, ActiveReservationsService keeps track of expiry (based on reservation time) of
3671active reservations. As the client will be shown a timer (for the expiration time), which could 
3672129
3673be a little out of sync with the server, we can add a buffer of five seconds on the server to
3674safeguard from a broken experience, such that the client never times out after the server,
3675preventing a successful purchase.
36769. Concurrency
3677How to handle concurrency, such that no two users are able to book same seat. We can use
3678transactions in SQL databases to avoid any clashes. For example, if we are using an SQL
3679server we can utilize Transaction Isolation Levels to lock the rows before we can update them.
3680Here is the sample code:
3681SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
3682BEGIN TRANSACTION;
3683 -- Suppose we intend to reserve three seats (IDs: 54, 55, 56) for ShowID=99
3684 Select * From Show_Seat where ShowID=99 && ShowSeatID in (54, 55, 56) && Status=0 -
3685- free
3686 -- if the number of rows returned by the above statement is three, we can update to
3687 -- return success otherwise return failure to the user.
3688 update Show_Seat ...
3689 update Booking ...
3690COMMIT TRANSACTION;
3691‘Serializable’ is the highest isolation level and guarantees safety from Dirty, Nonrepeatable,
3692and Phantoms reads. One thing to note here; within a transaction, if we read rows, we get a
3693write lock on them so that they can’t be updated by anyone else.
3694Once the above database transaction is successful, we can start tracking the reservation in
3695ActiveReservationService.
369610. Fault Tolerance
3697What happens when ActiveReservationsService or WaitingUsersService crashes?
3698Whenever ActiveReservationsService crashes, we can read all the active reservations from the
3699‘Booking’ table. Remember that we keep the ‘Status’ column as ‘Reserved (1)’ until a
3700reservation gets booked. Another option is to have a master-slave configuration so that, when
3701the master crashes, the slave can take over. We are not storing the waiting users in the
3702database, so, when WaitingUsersService crashes, we don’t have any means to recover that
3703data unless we have a master-slave setup.
3704Similarly, we’ll have a master-slave setup for databases to make them fault tolerant.
3705130
370611. Data Partitioning
3707Database partitioning: If we partition by ‘MovieID’, then all the Shows of a movie will be on a
3708single server. For a very hot movie, this could cause a lot of load on that server. A better
3709approach would be to partition based on ShowID; this way, the load gets distributed among
3710different servers.
3711ActiveReservationService and WaitingUserService partitioning: Our web servers will
3712manage all the active users’ sessions and handle all the communication with the users. We can
3713use the Consistent Hashing to allocate application servers for both ActiveReservationService
3714and WaitingUserService based upon the ‘ShowID’. This way, all reservations and waiting
3715users of a particular show will be handled by a certain set of servers. Let’s assume for load
3716balancing our Consistent Hashing allocates three servers for any Show, so whenever a
3717reservation is expired, the server holding that reservation will do the following things:
37181. Update database to remove the Booking (or mark it expired) and update the seats’
3719Status in ‘Show_Seats’ table.
37202. Remove the reservation from the Linked HashMap.
37213. Notify the user that their reservation has expired.
37224. Broadcast a message to all WaitingUserService servers that are holding waiting users of
3723that Show to figure out the longest waiting user. Consistent Hashing scheme will tell
3724what servers are holding these users.
37255. Send a message to the WaitingUserService server holding the longest waiting user to
3726process their request if required seats have become available.
3727Whenever a reservation is successful, following things will happen:
37281. The server holding that reservation sends a message to all servers holding the waiting
3729users of that Show, so that those servers can expire all the waiting users that need more
3730seats than the available seats.
37312. Upon receiving the above message, all servers holding the waiting users will query the
3732database to find how many free seats are available now. A database cache would greatly
3733help here to run this query only once.
37343. Expire all waiting users who want to reserve more seats than the available seats. For
3735this, WaitingUserService has to iterate through the Linked HashMap of all the waiting
3736users.
3737131
3738System Design Basics
3739Whenever we are designing a large system, we need to consider a few things:
37401. What are the different architectural pieces that can be used?
37412. How do these pieces work with each other?
37423. How can we best utilize these pieces: what are the right tradeoffs?
3743Investing in scaling before it is needed is generally not a smart business proposition; however,
3744some forethought into the design can save valuable time and resources in the future. In the
3745following chapters, we will try to define some of the core building blocks of scalable systems.
3746Familiarizing these concepts would greatly benefit in understanding distributed system
3747concepts. In the next section, we will go through Consistent Hashing, CAP Theorem, Load
3748Balancing, Caching, Data Partitioning, Indexes, Proxies, Queues, Replication, and choosing
3749between SQL vs. NoSQL.
3750Let’s start with the Key Characteristics of Distributed Systems.
3751132
3752Key Characteristics of Distributed Systems
3753Key characteristics of a distributed system include Scalability, Reliability, Availability,
3754Efficiency, and Manageability. Let’s briefly review them:
3755Scalability
3756Scalability is the capability of a system, process, or a network to grow and manage increased
3757demand. Any distributed system that can continuously evolve in order to support the growing
3758amount of work is considered to be scalable.
3759A system may have to scale because of many reasons like increased data volume or increased
3760amount of work, e.g., number of transactions. A scalable system would like to achieve this
3761scaling without performance loss.
3762Generally, the performance of a system, although designed (or claimed) to be scalable,
3763declines with the system size due to the management or environment cost. For instance,
3764network speed may become slower because machines tend to be far apart from one another.
3765More generally, some tasks may not be distributed, either because of their inherent atomic
3766nature or because of some flaw in the system design. At some point, such tasks would limit the
3767speed-up obtained by distribution. A scalable architecture avoids this situation and attempts
3768to balance the load on all the participating nodes evenly.
3769Horizontal vs. Vertical Scaling: Horizontal scaling means that you scale by adding more
3770servers into your pool of resources whereas Vertical scaling means that you scale by adding
3771more power (CPU, RAM, Storage, etc.) to an existing server.
3772With horizontal-scaling it is often easier to scale dynamically by adding more machines into
3773the existing pool; Vertical-scaling is usually limited to the capacity of a single server and
3774scaling beyond that capacity often involves downtime and comes with an upper limit.
3775Good examples of horizontal scaling are Cassandra and MongoDB as they both provide an
3776easy way to scale horizontally by adding more machines to meet growing needs. Similarly, a
3777good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by
3778switching from smaller to bigger machines. However, this process often involves downtime.
3779Vertical scaling vs. Horizontal scaling
3780Reliability
3781By definition, reliability is the probability a system will fail in a given period. In simple terms,
3782a distributed system is considered reliable if it keeps delivering its services even when one or
3783several of its software or hardware components fail. Reliability represents one of the main
3784characteristics of any distributed system, since in such systems any failing machine can
3785always be replaced by another healthy one, ensuring the completion of the requested task.
3786133
3787Take the example of a large electronic commerce store (like Amazon), where one of the
3788primary requirement is that any user transaction should never be canceled due to a failure of
3789the machine that is running that transaction. For instance, if a user has added an item to their
3790shopping cart, the system is expected not to lose it. A reliable distributed system achieves this
3791through redundancy of both the software components and data. If the server carrying the
3792user’s shopping cart fails, another server that has the exact replica of the shopping cart should
3793replace it.
3794Obviously, redundancy has a cost and a reliable system has to pay that to achieve such
3795resilience for services by eliminating every single point of failure.
3796Availability
3797By definition, availability is the time a system remains operational to perform its required
3798function in a specific period. It is a simple measure of the percentage of time that a system,
3799service, or a machine remains operational under normal conditions. An aircraft that can be
3800flown for many hours a month without much downtime can be said to have a high availability.
3801Availability takes into account maintainability, repair time, spares availability, and other
3802logistics considerations. If an aircraft is down for maintenance, it is considered not available
3803during that time.
3804Reliability is availability over time considering the full range of possible real-world conditions
3805that can occur. An aircraft that can make it through any possible weather safely is more
3806reliable than one that has vulnerabilities to possible conditions.
3807Reliability Vs. Availability
3808If a system is reliable, it is available. However, if it is available, it is not necessarily reliable. In
3809other words, high reliability contributes to high availability, but it is possible to achieve a high
3810availability even with an unreliable product by minimizing repair time and ensuring that
3811spares are always available when they are needed. Let’s take the example of an online retail
3812store that has 99.99% availability for the first two years after its launch. However, the system
3813was launched without any information security testing. The customers are happy with the
3814system, but they don’t realize that it isn’t very reliable as it is vulnerable to likely risks. In the
3815third year, the system experiences a series of information security incidents that suddenly
3816result in extremely low availability for extended periods of time. This results in reputational
3817and financial damage to the customers.
3818Efficiency
3819To understand how to measure the efficiency of a distributed system, let’s assume we have an
3820operation that runs in a distributed manner and delivers a set of items as result. Two standard
3821measures of its efficiency are the response time (or latency) that denotes the delay to obtain
3822the first item and the throughput (or bandwidth) which denotes the number of items
3823delivered in a given time unit (e.g., a second). The two measures correspond to the following
3824unit costs:
3825134
3826• Number of messages globally sent by the nodes of the system regardless of the message
3827size.
3828• Size of messages representing the volume of data exchanges.
3829The complexity of operations supported by distributed data structures (e.g., searching for a
3830specific key in a distributed index) can be characterized as a function of one of these cost
3831units. Generally speaking, the analysis of a distributed structure in terms of ‘number of
3832messages’ is over-simplistic. It ignores the impact of many aspects, including the network
3833topology, the network load, and its variation, the possible heterogeneity of the software and
3834hardware components involved in data processing and routing, etc. However, it is quite
3835difficult to develop a precise cost model that would accurately take into account all these
3836performance factors; therefore, we have to live with rough but robust estimates of the system
3837behavior.
3838Serviceability or Manageability
3839Another important consideration while designing a distributed system is how easy it is to
3840operate and maintain. Serviceability or manageability is the simplicity and speed with which a
3841system can be repaired or maintained; if the time to fix a failed system increases, then
3842availability will decrease. Things to consider for manageability are the ease of diagnosing and
3843understanding problems when they occur, ease of making updates or modifications, and how
3844simple the system is to operate (i.e., does it routinely operate without failure or exceptions?).
3845Early detection of faults can decrease or avoid system downtime. For example, some
3846enterprise systems can automatically call a service center (without human intervention) when
3847the system experiences a system fault.
3848135
3849Load Balancing
3850Load Balancer (LB) is another critical component of any distributed system. It helps to spread
3851the traffic across a cluster of servers to improve responsiveness and availability of
3852applications, websites or databases. LB also keeps track of the status of all the resources while
3853distributing requests. If a server is not available to take new requests or is not responding or
3854has elevated error rate, LB will stop sending traffic to such a server.
3855Typically a load balancer sits between the client and the server accepting incoming network
3856and application traffic and distributing the traffic across multiple backend servers using
3857various algorithms. By balancing application requests across multiple servers, a load balancer
3858reduces individual server load and prevents any one application server from becoming a
3859single point of failure, thus improving overall application availability and responsiveness.
3860To utilize full scalability and redundancy, we can try to balance the load at each layer of the
3861system. We can add LBs at three places:
3862• Between the user and the web server
3863• Between web servers and an internal platform layer, like application servers or cache
3864servers
3865• Between internal platform layer and database.
3866Benefits of Load Balancing
3867• Users experience faster, uninterrupted service. Users won’t have to wait for a single
3868struggling server to finish its previous tasks. Instead, their requests are immediately
3869passed on to a more readily available resource.
3870• Service providers experience less downtime and higher throughput. Even a full server
3871failure won’t affect the end user experience as the load balancer will simply route
3872around it to a healthy server.
3873• Load balancing makes it easier for system administrators to handle incoming requests
3874while decreasing wait time for users.
3875• Smart load balancers provide benefits like predictive analytics that determine traffic
3876bottlenecks before they happen. As a result, the smart load balancer gives an 
3877136
3878organization actionable insights. These are key to automation and can help drive
3879business decisions.
3880• System administrators experience fewer failed or stressed components. Instead of a
3881single device performing a lot of work, load balancing has several devices perform a
3882little bit of work.
3883Load Balancing Algorithms
3884How does the load balancer choose the backend server?
3885Load balancers consider two factors before forwarding a request to a backend server. They
3886will first ensure that the server they choose is actually responding appropriately to requests
3887and then use a pre-configured algorithm to select one from the set of healthy servers. We will
3888discuss these algorithms shortly.
3889Health Checks - Load balancers should only forward traffic to “healthy” backend servers. To
3890monitor the health of a backend server, “health checks” regularly attempt to connect to
3891backend servers to ensure that servers are listening. If a server fails a health check, it is
3892automatically removed from the pool, and traffic will not be forwarded to it until it responds
3893to the health checks again.
3894There is a variety of load balancing methods, which use different algorithms for different
3895needs.
3896• Least Connection Method — This method directs traffic to the server with the fewest
3897active connections. This approach is quite useful when there are a large number of
3898persistent client connections which are unevenly distributed between the servers.
3899• Least Response Time Method — This algorithm directs traffic to the server with the
3900fewest active connections and the lowest average response time.
3901• Least Bandwidth Method - This method selects the server that is currently serving the
3902least amount of traffic measured in megabits per second (Mbps).
3903• Round Robin Method — This method cycles through a list of servers and sends each
3904new request to the next server. When it reaches the end of the list, it starts over at the
3905beginning. It is most useful when the servers are of equal specification and there are not
3906many persistent connections.
3907• Weighted Round Robin Method — The weighted round-robin scheduling is designed to
3908better handle servers with different processing capacities. Each server is assigned a
3909weight (an integer value that indicates the processing capacity). Servers with higher
3910weights receive new connections before those with less weights and servers with higher
3911weights get more connections than those with less weights.
3912• IP Hash — Under this method, a hash of the IP address of the client is calculated to
3913redirect the request to a server.
3914Redundant Load Balancers
3915The load balancer can be a single point of failure; to overcome this, a second load balancer can
3916be connected to the first to form a cluster. Each LB monitors the health of the other and, since 
3917137
3918both of them are equally capable of serving traffic and failure detection, in the event the main
3919load balancer fails, the second load balancer takes over.
3920Following links have some good discussion about load balancers:
3921[1] What is load balancing
3922[2] Introduction to architecting systems
3923[3] Load balancing
3924138
3925Caching
3926Load balancing helps you scale horizontally across an ever-increasing number of servers, but
3927caching will enable you to make vastly better use of the resources you already have as well as
3928making otherwise unattainable product requirements feasible. Caches take advantage of the
3929locality of reference principle: recently requested data is likely to be requested again. They are
3930used in almost every layer of computing: hardware, operating systems, web browsers, web
3931applications, and more. A cache is like short-term memory: it has a limited amount of space,
3932but is typically faster than the original data source and contains the most recently accessed
3933items. Caches can exist at all levels in architecture, but are often found at the level nearest to
3934the front end where they are implemented to return data quickly without taxing downstream
3935levels.
3936Application server cache
3937Placing a cache directly on a request layer node enables the local storage of response data.
3938Each time a request is made to the service, the node will quickly return local cached data if it
3939exists. If it is not in the cache, the requesting node will query the data from disk. The cache on
3940one request layer node could also be located both in memory (which is very fast) and on the
3941node’s local disk (faster than going to network storage).
3942What happens when you expand this to many nodes? If the request layer is expanded to
3943multiple nodes, it’s still quite possible to have each node host its own cache. However, if your
3944load balancer randomly distributes requests across the nodes, the same request will go to
3945different nodes, thus increasing cache misses. Two choices for overcoming this hurdle are
3946global caches and distributed caches.
3947Content Distribution Network (CDN)
3948CDNs are a kind of cache that comes into play for sites serving large amounts of static media.
3949In a typical CDN setup, a request will first ask the CDN for a piece of static media; the CDN
3950will serve that content if it has it locally available. If it isn’t available, the CDN will query the
3951back-end servers for the file, cache it locally, and serve it to the requesting user.
3952If the system we are building isn’t yet large enough to have its own CDN, we can ease a future
3953transition by serving the static media off a separate subdomain (e.g. static.yourservice.com)
3954using a lightweight HTTP server like Nginx, and cut-over the DNS from your servers to a CDN
3955later.
3956Cache Invalidation
3957While caching is fantastic, it does require some maintenance for keeping cache coherent with
3958the source of truth (e.g., database). If the data is modified in the database, it should be
3959invalidated in the cache; if not, this can cause inconsistent application behavior.
3960139
3961Solving this problem is known as cache invalidation; there are three main schemes that are
3962used:
3963Write-through cache: Under this scheme, data is written into the cache and the
3964corresponding database at the same time. The cached data allows for fast retrieval and, since
3965the same data gets written in the permanent storage, we will have complete data consistency
3966between the cache and the storage. Also, this scheme ensures that nothing will get lost in case
3967of a crash, power failure, or other system disruptions.
3968Although, write through minimizes the risk of data loss, since every write operation must be
3969done twice before returning success to the client, this scheme has the disadvantage of higher
3970latency for write operations.
3971Write-around cache: This technique is similar to write through cache, but data is written
3972directly to permanent storage, bypassing the cache. This can reduce the cache being flooded
3973with write operations that will not subsequently be re-read, but has the disadvantage that a
3974read request for recently written data will create a “cache miss” and must be read from slower
3975back-end storage and experience higher latency.
3976Write-back cache: Under this scheme, data is written to cache alone and completion is
3977immediately confirmed to the client. The write to the permanent storage is done after
3978specified intervals or under certain conditions. This results in low latency and high
3979throughput for write-intensive applications, however, this speed comes with the risk of data
3980loss in case of a crash or other adverse event because the only copy of the written data is in the
3981cache.
3982Cache eviction policies
3983Following are some of the most common cache eviction policies:
39841. First In First Out (FIFO): The cache evicts the first block accessed first without any
3985regard to how often or how many times it was accessed before.
39862. Last In First Out (LIFO): The cache evicts the block accessed most recently first without
3987any regard to how often or how many times it was accessed before.
39883. Least Recently Used (LRU): Discards the least recently used items first.
39894. Most Recently Used (MRU): Discards, in contrast to LRU, the most recently used items
3990first.
39915. Least Frequently Used (LFU): Counts how often an item is needed. Those that are used
3992least often are discarded first.
39936. Random Replacement (RR): Randomly selects a candidate item and discards it to make
3994space when necessary.
3995Following links have some good discussion about caching:
3996[1] Cache
3997[2] Introduction to architecting systems
3998140
3999Data Partitioning
4000Data partitioning is a technique to break up a big database (DB) into many smaller parts. It is
4001the process of splitting up a DB/table across multiple machines to improve the manageability,
4002performance, availability, and load balancing of an application. The justification for data
4003partitioning is that, after a certain scale point, it is cheaper and more feasible to scale
4004horizontally by adding more machines than to grow it vertically by adding beefier servers.
40051. Partitioning Methods
4006There are many different schemes one could use to decide how to break up an application
4007database into multiple smaller DBs. Below are three of the most popular schemes used by
4008various large scale applications.
4009a. Horizontal partitioning: In this scheme, we put different rows into different tables. For
4010example, if we are storing different places in a table, we can decide that locations with ZIP
4011codes less than 10000 are stored in one table and places with ZIP codes greater than 10000
4012are stored in a separate table. This is also called a range based partitioning as we are storing
4013different ranges of data in separate tables. Horizontal partitioning is also called as Data
4014Sharding.
4015The key problem with this approach is that if the value whose range is used for partitioning
4016isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers. In the
4017previous example, splitting location based on their zip codes assumes that places will be
4018evenly distributed across the different zip codes. This assumption is not valid as there will be a
4019lot of places in a thickly populated area like Manhattan as compared to its suburb cities.
4020b. Vertical Partitioning: In this scheme, we divide our data to store tables related to a specific
4021feature in their own server. For example, if we are building Instagram like application - where
4022we need to store data related to users, photos they upload, and people they follow - we can
4023decide to place user profile information on one DB server, friend lists on another, and photos
4024on a third server.
4025Vertical partitioning is straightforward to implement and has a low impact on the application.
4026The main problem with this approach is that if our application experiences additional growth,
4027then it may be necessary to further partition a feature specific DB across various servers (e.g.
4028it would not be possible for a single server to handle all the metadata queries for 10 billion
4029photos by 140 million users).
4030c. Directory Based Partitioning: A loosely coupled approach to work around issues
4031mentioned in the above schemes is to create a lookup service which knows your current
4032partitioning scheme and abstracts it away from the DB access code. So, to find out where a
4033particular data entity resides, we query the directory server that holds the mapping between
4034each tuple key to its DB server. This loosely coupled approach means we can perform tasks
4035like adding servers to the DB pool or changing our partitioning scheme without having an
4036impact on the application.
4037141
40382. Partitioning Criteria
4039a. Key or Hash-based partitioning: Under this scheme, we apply a hash function to some key
4040attributes of the entity we are storing; that yields the partition number. For example, if we
4041have 100 DB servers and our ID is a numeric value that gets incremented by one each time a
4042new record is inserted. In this example, the hash function could be ‘ID % 100’, which will give
4043us the server number where we can store/read that record. This approach should ensure a
4044uniform allocation of data among servers. The fundamental problem with this approach is
4045that it effectively fixes the total number of DB servers, since adding new servers
4046means changing the hash function which would require redistribution of data and downtime
4047for the service. A workaround for this problem is to use Consistent Hashing.
4048b. List partitioning: In this scheme, each partition is assigned a list of values, so whenever we
4049want to insert a new record, we will see which partition contains our key and then store it
4050there. For example, we can decide all users living in Iceland, Norway, Sweden, Finland, or
4051Denmark will be stored in a partition for the Nordic countries.
4052c. Round-robin partitioning: This is a very simple strategy that ensures uniform data
4053distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
4054d. Composite partitioning: Under this scheme, we combine any of the above partitioning
4055schemes to devise a new scheme. For example, first applying a list partitioning scheme and
4056then a hash based partitioning. Consistent hashing could be considered a composite of hash
4057and list partitioning where the hash reduces the key space to a size that can be listed.
40583. Common Problems of Data Partitioning
4059On a partitioned database, there are certain extra constraints on the different operations that
4060can be performed. Most of these constraints are due to the fact that operations across multiple
4061tables or multiple rows in the same table will no longer run on the same server. Below are
4062some of the constraints and additional complexities introduced by partitioning:
4063a. Joins and Denormalization: Performing joins on a database which is running on one server
4064is straightforward, but once a database is partitioned and spread across multiple machines it
4065is often not feasible to perform joins that span database partitions. Such joins will not be
4066performance efficient since data has to be compiled from multiple servers. A common
4067workaround for this problem is to denormalize the database so that queries that previously
4068required joins can be performed from a single table. Of course, the service now has to deal
4069with all the perils of denormalization such as data inconsistency.
4070b. Referential integrity: As we saw that performing a cross-partition query on a partitioned
4071database is not feasible, similarly, trying to enforce data integrity constraints such as foreign
4072keys in a partitioned database can be extremely difficult.
4073Most of RDBMS do not support foreign keys constraints across databases on different
4074database servers. Which means that applications that require referential integrity on 
4075142
4076partitioned databases often have to enforce it in application code. Often in such cases,
4077applications have to run regular SQL jobs to clean up dangling references.
4078c. Rebalancing: There could be many reasons we have to change our partitioning scheme:
40791. The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP
4080code that cannot fit into one database partition.
40812. There is a lot of load on a partition, e.g., there are too many requests being handled by
4082the DB partition dedicated to user photos.
4083In such cases, either we have to create more DB partitions or have to rebalance existing
4084partitions, which means the partitioning scheme changed and all existing data moved to new
4085locations. Doing this without incurring downtime is extremely difficult. Using a scheme like
4086directory based partitioning does make rebalancing a more palatable experience at the cost of
4087increasing the complexity of the system and creating a new single point of failure (i.e. the
4088lookup service/database).
4089143
4090Indexes
4091Indexes are well known when it comes to databases. Sooner or later there comes a time when
4092database performance is no longer satisfactory. One of the very first things you should turn to
4093when that happens is database indexing.
4094The goal of creating an index on a particular table in a database is to make it faster to search
4095through the table and find the row or rows that we want. Indexes can be created using one or
4096more columns of a database table, providing the basis for both rapid random lookups and
4097efficient access of ordered records.
4098Example: A library catalog
4099A library catalog is a register that contains the list of books found in a library. The catalog is
4100organized like a database table generally with four columns: book title, writer, subject, and
4101date of publication. There are usually two such catalogs: one sorted by the book title and one
4102sorted by the writer name. That way, you can either think of a writer you want to read and
4103then look through their books or look up a specific book title you know you want to read in
4104case you don’t know the writer’s name. These catalogs are like indexes for the database of
4105books. They provide a sorted list of data that is easily searchable by relevant information.
4106Simply saying, an index is a data structure that can be perceived as a table of contents that
4107points us to the location where actual data lives. So when we create an index on a column of a
4108table, we store that column and a pointer to the whole row in the index. Let’s assume a table
4109containing a list of books, the following diagram shows how an index on the ‘Title’ column
4110looks like:
4111Just like a traditional relational data store, we can also apply this concept to larger datasets.
4112The trick with indexes is that we must carefully consider how users will access the data. In the
4113case of data sets that are many terabytes in size, but have very small payloads (e.g., 1 KB),
4114indexes are a necessity for optimizing data access. Finding a small payload in such a large
4115dataset can be a real challenge, since we can’t possibly iterate over that much data in any
4116reasonable time. Furthermore, it is very likely that such a large data set is spread over several
4117physical devices—this means we need some way to find the correct physical location of the
4118desired data. Indexes are the best way to do this.
4119How do Indexes decrease write performance?
4120An index can dramatically speed up data retrieval but may itself be large due to the additional
4121keys, which slow down data insertion & update.
4122When adding rows or making updates to existing rows for a table with an active index, we not
4123only have to write the data but also have to update the index. This will decrease the write
4124performance. This performance degradation applies to all insert, update, and delete 
4125144
4126operations for the table. For this reason, adding unnecessary indexes on tables should be
4127avoided and indexes that are no longer used should be removed. To reiterate, adding indexes
4128is about improving the performance of search queries. If the goal of the database is to provide
4129a data store that is often written to and rarely read from, in that case, decreasing the
4130performance of the more common operation, which is writing, is probably not worth the
4131increase in performance we get from reading.
4132For more details, see Database Indexes.
4133145
4134Proxies
4135A proxy server is an intermediate server between the client and the back-end server. Clients
4136connect to proxy servers to make a request for a service like a web page, file, connection, etc.
4137In short, a proxy server is a piece of software or hardware that acts as an intermediary for
4138requests from clients seeking resources from other servers.
4139Typically, proxies are used to filter requests, log requests, or sometimes transform requests
4140(by adding/removing headers, encrypting/decrypting, or compressing a resource). Another
4141advantage of a proxy server is that its cache can serve a lot of requests. If multiple clients
4142access a particular resource, the proxy server can cache it and serve it to all the clients without
4143going to the remote server.
4144Proxy Server Types
4145Proxies can reside on the client’s local server or anywhere between the client and the remote
4146servers. Here are a few famous types of proxy servers:
4147Open Proxy
4148An open proxy is a proxy server that is accessible by any Internet user. Generally, a proxy
4149server only allows users within a network group (i.e. a closed proxy) to store and forward
4150Internet services such as DNS or web pages to reduce and control the bandwidth used by the
4151group. With an open proxy, however, any user on the Internet is able to use this forwarding
4152service. There two famous open proxy types:
41531. Anonymous Proxy - Thіs proxy reveаls іts іdentіty аs а server but does not dіsclose the
4154іnіtіаl IP аddress. Though thіs proxy server cаn be dіscovered eаsіly іt cаn be benefіcіаl
4155for some users аs іt hіdes their IP аddress.
41562. Trаnspаrent Proxy – Thіs proxy server аgаіn іdentіfіes іtself, аnd wіth the support of
4157HTTP heаders, the fіrst IP аddress cаn be vіewed. The mаіn benefіt of usіng thіs sort of
4158server іs іts аbіlіty to cаche the websіtes.
4159146
4160Reverse Proxy
4161A reverse proxy retrieves resources on behalf of a client from one or more servers. These
4162resources are then returned to the client, appearing as if they originated from the proxy server
4163itself
4164147
4165Redundancy and Replication
4166Redundancy is the duplication of critical components or functions of a system with the
4167intention of increasing the reliability of the system, usually in the form of a backup or fail-safe,
4168or to improve actual system performance. For example, if there is only one copy of a file
4169stored on a single server, then losing that server means losing the file. Since losing data is
4170seldom a good thing, we can create duplicate or redundant copies of the file to solve this
4171problem.
4172Redundancy plays a key role in removing the single points of failure in the system and
4173provides backups if needed in a crisis. For example, if we have two instances of a service
4174running in production and one fails, the system can failover to the other one.
4175Replication means sharing information to ensure consistency between redundant resources,
4176such as software or hardware components, to improve reliability, fault-tolerance, or
4177accessibility.
4178Replication is widely used in many database management systems (DBMS), usually with a
4179master-slave relationship between the original and the copies. The master gets all the updates,
4180which then ripple through to the slaves. Each slave outputs a message stating that it has
4181received the update successfully, thus allowing the sending of subsequent updates.
4182148
4183SQL vs. NoSQL
4184In the world of databases, there are two main types of solutions: SQL and NoSQL (or
4185relational databases and non-relational databases). Both of them differ in the way they were
4186built, the kind of information they store, and the storage method they use.
4187Relational databases are structured and have predefined schemas like phone books that store
4188phone numbers and addresses. Non-relational databases are unstructured, distributed, and
4189have a dynamic schema like file folders that hold everything from a person’s address and
4190phone number to their Facebook ‘likes’ and online shopping preferences.
4191SQL
4192Relational databases store data in rows and columns. Each row contains all the information
4193about one entity and each column contains all the separate data points. Some of the most
4194popular relational databases are MySQL, Oracle, MS SQL Server, SQLite, Postgres, and
4195MariaDB.
4196NoSQL
4197Following are the most common types of NoSQL:
4198Key-Value Stores: Data is stored in an array of key-value pairs. The ‘key’ is an attribute name
4199which is linked to a ‘value’. Well-known key-value stores include Redis, Voldemort, and
4200Dynamo.
4201Document Databases: In these databases, data is stored in documents (instead of rows and
4202columns in a table) and these documents are grouped together in collections. Each document
4203can have an entirely different structure. Document databases include the CouchDB and
4204MongoDB.
4205Wide-Column Databases: Instead of ‘tables,’ in columnar databases we have column families,
4206which are containers for rows. Unlike relational databases, we don’t need to know all the
4207columns up front and each row doesn’t have to have the same number of columns. Columnar
4208databases are best suited for analyzing large datasets - big names include Cassandra and
4209HBase.
4210Graph Databases: These databases are used to store data whose relations are best represented
4211in a graph. Data is saved in graph structures with nodes (entities), properties (information
4212about the entities), and lines (connections between the entities). Examples of graph database
4213include Neo4J and InfiniteGraph.
4214149
4215High level differences between SQL and NoSQL
4216Storage: SQL stores data in tables where each row represents an entity and each column
4217represents a data point about that entity; for example, if we are storing a car entity in a table,
4218different columns could be ‘Color’, ‘Make’, ‘Model’, and so on.
4219NoSQL databases have different data storage models. The main ones are key-value,
4220document, graph, and columnar. We will discuss differences between these databases below.
4221Schema: In SQL, each record conforms to a fixed schema, meaning the columns must be
4222decided and chosen before data entry and each row must have data for each column. The
4223schema can be altered later, but it involves modifying the whole database and going offline.
4224In NoSQL, schemas are dynamic. Columns can be added on the fly and each ‘row’ (or
4225equivalent) doesn’t have to contain data for each ‘column.’
4226Querying: SQL databases use SQL (structured query language) for defining and manipulating
4227the data, which is very powerful. In a NoSQL database, queries are focused on a collection of
4228documents. Sometimes it is also called UnQL (Unstructured Query Language). Different
4229databases have different syntax for using UnQL.
4230Scalability: In most common situations, SQL databases are vertically scalable, i.e., by
4231increasing the horsepower (higher Memory, CPU, etc.) of the hardware, which can get very
4232expensive. It is possible to scale a relational database across multiple servers, but this is a
4233challenging and time-consuming process.
4234On the other hand, NoSQL databases are horizontally scalable, meaning we can add more
4235servers easily in our NoSQL database infrastructure to handle a lot of traffic. Any cheap
4236commodity hardware or cloud instances can host NoSQL databases, thus making it a lot more
4237cost-effective than vertical scaling. A lot of NoSQL technologies also distribute data across
4238servers automatically.
4239Reliability or ACID Compliancy (Atomicity, Consistency, Isolation, Durability): The vast
4240majority of relational databases are ACID compliant. So, when it comes to data reliability and
4241safe guarantee of performing transactions, SQL databases are still the better bet.
4242Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability.
4243SQL VS. NoSQL - Which one to use?
4244When it comes to database technology, there’s no one-size-fits-all solution. That’s why many
4245businesses rely on both relational and non-relational databases for different needs. Even as
4246NoSQL databases are gaining popularity for their speed and scalability, there are still
4247situations where a highly structured SQL database may perform better; choosing the right
4248technology hinges on the use case.
4249150
4250Reasons to use SQL database
4251Here are a few reasons to choose a SQL database:
42521. We need to ensure ACID compliance. ACID compliance reduces anomalies and protects
4253the integrity of your database by prescribing exactly how transactions interact with the
4254database. Generally, NoSQL databases sacrifice ACID compliance for scalability and
4255processing speed, but for many e-commerce and financial applications, an ACIDcompliant database remains the preferred option.
42562. Your data is structured and unchanging. If your business is not experiencing massive
4257growth that would require more servers and if you’re only working with data that is
4258consistent, then there may be no reason to use a system designed to support a variety of
4259data types and high traffic volume.
4260Reasons to use NoSQL database
4261When all the other components of our application are fast and seamless, NoSQL databases
4262prevent data from being the bottleneck. Big data is contributing to a large success for NoSQL
4263databases, mainly because it handles data differently than the traditional relational databases.
4264A few popular examples of NoSQL databases are MongoDB, CouchDB, Cassandra, and HBase.
42651. Storing large volumes of data that often have little to no structure. A NoSQL database
4266sets no limits on the types of data we can store together and allows us to add new types
4267as the need changes. With document-based databases, you can store data in one place
4268without having to define what “types” of data those are in advance.
42692. Making the most of cloud computing and storage. Cloud-based storage is an excellent
4270cost-saving solution but requires data to be easily spread across multiple servers to
4271scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves
4272you the hassle of additional software and NoSQL databases like Cassandra are designed
4273to be scaled across multiple data centers out of the box, without a lot of headaches.
42743. Rapid development. NoSQL is extremely useful for rapid development as it doesn’t
4275need to be prepped ahead of time. If you’re working on quick iterations of your system
4276which require making frequent updates to the data structure without a lot of downtime
4277between versions, a relational database will slow you down.
4278151
4279CAP Theorem
4280CAP theorem states that it is impossible for a distributed software system to simultaneously
4281provide more than two out of three of the following guarantees (CAP): Consistency,
4282Availability, and Partition tolerance. When we design a distributed system, trading off among
4283CAP is almost the first thing we want to consider. CAP theorem says while designing a
4284distributed system we can pick only two of the following three options:
4285Consistency: All nodes see the same data at the same time. Consistency is achieved by
4286updating several nodes before allowing further reads.
4287Availability: Every request gets a response on success/failure. Availability is achieved by
4288replicating the data across different servers.
4289Partition tolerance: The system continues to work despite message loss or partial failure. A
4290system that is partition-tolerant can sustain any amount of network failure that doesn’t result
4291in a failure of the entire network. Data is sufficiently replicated across combinations of nodes
4292and networks to keep the system up through intermittent outages.
4293We cannot build a general data store that is continually available, sequentially consistent, and
4294tolerant to any partition failures. We can only build a system that has any two of these three
4295properties. Because, to be consistent, all nodes should see the same set of updates in the same
4296order. But if the network suffers a partition, updates in one partition might not make it to the
4297other partitions before a client reads from the out-of-date partition after having read from the
4298up-to-date one. The only thing that can be done to cope with this possibility is to stop serving
4299requests from the out-of-date partition, but then the service is no longer 100% available.
4300152
4301Consistent Hashing
4302Distributed Hash Table (DHT) is one of the fundamental components used in distributed
4303scalable systems. Hash Tables need a key, a value, and a hash function where hash function
4304maps the key to a location where the value is stored.
4305index = hash_function(key)
4306Suppose we are designing a distributed caching system. Given ‘n’ cache servers, an intuitive
4307hash function would be ‘key % n’. It is simple and commonly used. But it has two major
4308drawbacks:
43091. It is NOT horizontally scalable. Whenever a new cache host is added to the system, all
4310existing mappings are broken. It will be a pain point in maintenance if the caching
4311system contains lots of data. Practically, it becomes difficult to schedule a downtime to
4312update all caching mappings.
43132. It may NOT be load balanced, especially for non-uniformly distributed data. In practice,
4314it can be easily assumed that the data will not be distributed uniformly. For the caching
4315system, it translates into some caches becoming hot and saturated while the others idle
4316and are almost empty.
4317In such situations, consistent hashing is a good way to improve the caching system.
4318What is Consistent Hashing?
4319Consistent hashing is a very useful strategy for distributed caching system and DHTs. It
4320allows us to distribute data across a cluster in such a way that will minimize reorganization
4321when nodes are added or removed. Hence, the caching system will be easier to scale up or
4322scale down.
4323In Consistent Hashing, when the hash table is resized (e.g. a new cache host is added to the
4324system), only ‘k/n’ keys need to be remapped where ‘k’ is the total number of keys and ‘n’ is
4325the total number of servers. Recall that in a caching system using the ‘mod’ as the hash
4326function, all keys need to be remapped.
4327In Consistent Hashing, objects are mapped to the same host if possible. When a host is
4328removed from the system, the objects on that host are shared by other hosts; when a new host
4329is added, it takes its share from a few hosts without touching other’s shares.
4330How does it work?
4331As a typical hash function, consistent hashing maps a key to an integer. Suppose the output of
4332the hash function is in the range of [0, 256). Imagine that the integers in the range are placed
4333on a ring such that the values are wrapped around.
4334Here’s how consistent hashing works:
4335153
43361. Given a list of cache servers, hash them to integers in the range.
4337154
43382. To map a key to a server,
4339o Hash it to a single integer.
4340o Move clockwise on the ring until finding the first cache it encounters.
4341o That cache is the one that contains the key. See animation below as an example:
4342key1 maps to cache A; key2 maps to cache C.
4343To add a new server, say D, keys that were originally residing at C will be split. Some of them
4344will be shifted to D, while other keys will not be touched.
4345To remove a cache or, if a cache fails, say A, all keys that were originally mapped to A will fall
4346into B, and only those keys need to be moved to B; other keys will not be affected.
4347For load balancing, as we discussed in the beginning, the real data is essentially randomly
4348distributed and thus may not be uniform. It may make the keys on caches unbalanced.
4349To handle this issue, we add “virtual replicas” for caches. Instead of mapping each cache to a
4350single point on the ring, we map it to multiple points on the ring, i.e. replicas. This way, each
4351cache is associated with multiple portions of the ring.
4352If the hash function “mixes well,” as the number of replicas increases, the keys will be more
4353balanced.
4354155
4355Long-Polling vs WebSockets vs Server-Sent Events
4356What is the difference between Long-Polling, WebSockets, and Server-Sent Events?
4357Long-Polling, WebSockets, and Server-Sent Events are popular communication protocols
4358between a client like a web browser and a web server. First, let’s start with understanding
4359what a standard HTTP web request looks like. Following are a sequence of events for regular
4360HTTP request:
43611. The client opens a connection and requests data from the server.
43622. The server calculates the response.
43633. The server sends the response back to the client on the opened request.
4364Ajax Polling
4365Polling is a standard technique used by the vast majority of AJAX applications. The basic idea
4366is that the client repeatedly polls (or requests) a server for data. The client makes a request
4367and waits for the server to respond with data. If no data is available, an empty response is
4368returned.
43691. The client opens a connection and requests data from the server using regular HTTP.
43702. The requested webpage sends requests to the server at regular intervals (e.g., 0.5
4371seconds).
43723. The server calculates the response and sends it back, just like regular HTTP traffic.
43734. The client repeats the above three steps periodically to get updates from the server.
4374The problem with Polling is that the client has to keep asking the server for any new data. As a
4375result, a lot of responses are empty, creating HTTP overhead.
4376156
4377HTTP Long-Polling
4378This is a variation of the traditional polling technique that allows the server to push
4379information to a client whenever the data is available. With Long-Polling, the client requests
4380information from the server exactly as in normal polling, but with the expectation that the
4381server may not respond immediately. That’s why this technique is sometimes referred to as a
4382“Hanging GET”.
4383• If the server does not have any data available for the client, instead of sending an empty
4384response, the server holds the request and waits until some data becomes available.
4385• Once the data becomes available, a full response is sent to the client. The client then
4386immediately re-request information from the server so that the server will almost
4387always have an available waiting request that it can use to deliver data in response to an
4388event.
4389The basic life cycle of an application using HTTP Long-Polling is as follows:
43901. The client makes an initial request using regular HTTP and then waits for a response.
43912. The server delays its response until an update is available or a timeout has occurred.
43923. When an update is available, the server sends a full response to the client.
43934. The client typically sends a new long-poll request, either immediately upon receiving a
4394response or after a pause to allow an acceptable latency period.
43955. Each Long-Poll request has a timeout. The client has to reconnect periodically after the
4396connection is closed due to timeouts.
4397157
4398WebSockets
4399WebSocket provides Full duplex communication channels over a single TCP connection. It
4400provides a persistent connection between a client and a server that both parties can use to
4401start sending data at any time. The client establishes a WebSocket connection through a
4402process known as the WebSocket handshake. If the process succeeds, then the server and
4403client can exchange data in both directions at any time. The WebSocket protocol enables
4404communication between a client and a server with lower overheads, facilitating real-time data
4405transfer from and to the server. This is made possible by providing a standardized way for the
4406server to send content to the browser without being asked by the client and allowing for
4407messages to be passed back and forth while keeping the connection open. In this way, a twoway (bi-directional) ongoing conversation can take place between a client and a server.
4408Server-Sent Events (SSEs)
4409Under SSEs the client establishes a persistent and long-term connection with the server. The
4410server uses this connection to send data to a client. If the client wants to send data to the
4411server, it would require the use of another technology/protocol to do so.
44121. Client requests data from a server using regular HTTP.
4413158
44142. The requested webpage opens a connection to the server.
44153. The server sends the data to the client whenever there’s new information available.
4416SSEs are best when we need real-time traffic from the server to the client or if the server is
4417generating data in a loop and will be sending multiple events to the client.