ZzKWeYEV

· 6 years ago · Feb 10, 2020, 02:58 AM
1Skip to content
2Why GitHub? 
3Enterprise
4Explore 
5Marketplace
6Pricing 
7Search
8
9Sign in
10Sign up
11donnemartin
12/
13system-design-primer
144.3k82.4k13.8k
15 Code Issues 78 Pull requests 58 Actions Projects 0 Security Insights
16system-design-primer/solutions/system_design/pastebin/
17@kevinxuv
18@donnemartin
19kevinxuv and donnemartin zh-Hans: Translate Pastebin solution (#273)
20Latest commit
2133431e6
22on Jun 16, 2019
23Type	Name	Latest commit message	Commit time
24..		
25README-zh-Hans.md	zh-Hans: Translate Pastebin solution (#273)	8 months ago
26README.md	Enable syntax highlighting in all python code snippets (#268)	9 months ago
27__init__.py	Add Pastebin solution	3 years ago
28pastebin.graffle	Add pastebin OmniGraffle diagrams	3 years ago
29pastebin.png	Add Pastebin solution	3 years ago
30pastebin.py	Fix coding errors (#149)	2 years ago
31pastebin_basic.graffle	Add pastebin OmniGraffle diagrams	3 years ago
32pastebin_basic.png	Add Pastebin solution	3 years ago
33 README.md
34Design Pastebin.com (or Bit.ly)
35Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
36
37Design Bit.ly - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.
38
39Step 1: Outline use cases and constraints
40Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
41
42Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
43
44Use cases
45We'll scope the problem to handle only the following use cases
46User enters a block of text and gets a randomly generated link
47Expiration
48Default setting does not expire
49Can optionally set a timed expiration
50User enters a paste's url and views the contents
51User is anonymous
52Service tracks analytics of pages
53Monthly visit stats
54Service deletes expired pastes
55Service has high availability
56Out of scope
57User registers for an account
58User verifies email
59User logs into a registered account
60User edits the document
61User can set visibility
62User can set the shortlink
63Constraints and assumptions
64State assumptions
65Traffic is not evenly distributed
66Following a short link should be fast
67Pastes are text only
68Page view analytics do not need to be realtime
6910 million users
7010 million paste writes per month
71100 million paste reads per month
7210:1 read to write ratio
73Calculate usage
74Clarify with your interviewer if you should run back-of-the-envelope usage calculations.
75
76Size per paste
771 KB content per paste
78shortlink - 7 bytes
79expiration_length_in_minutes - 4 bytes
80created_at - 5 bytes
81paste_path - 255 bytes
82total = ~1.27 KB
8312.7 GB of new paste content per month
841.27 KB per paste * 10 million pastes per month
85~450 GB of new paste content in 3 years
86360 million shortlinks in 3 years
87Assume most are new pastes instead of updates to existing ones
884 paste writes per second on average
8940 read requests per second on average
90Handy conversion guide:
91
922.5 million seconds per month
931 request per second = 2.5 million requests per month
9440 requests per second = 100 million requests per month
95400 requests per second = 1 billion requests per month
96Step 2: Create a high level design
97Outline a high level design with all important components.
98
99Imgur
100
101Step 3: Design core components
102Dive into details for each core component.
103
104Use case: User enters a block of text and gets a randomly generated link
105We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.
106
107Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.
108
109An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.
110
111The Client sends a create paste request to the Web Server, running as a reverse proxy
112The Web Server forwards the request to the Write API server
113The Write API server does the following:
114Generates a unique url
115Checks if the url is unique by looking at the SQL Database for a duplicate
116If the url is not unique, it generates another url
117If we supported a custom url, we could use the user-supplied (also check for a duplicate)
118Saves to the SQL Database pastes table
119Saves the paste data to the Object Store
120Returns the url
121Clarify with your interviewer how much code you are expected to write.
122
123The pastes table could have the following structure:
124
125shortlink char(7) NOT NULL
126expiration_length_in_minutes int NOT NULL
127created_at datetime NOT NULL
128paste_path varchar(255) NOT NULL
129PRIMARY KEY(shortlink)
130We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.1
131
132To generate the unique url, we could:
133
134Take the MD5 hash of the user's ip_address + timestamp
135MD5 is a widely used hashing function that produces a 128-bit hash value
136MD5 is uniformly distributed
137Alternatively, we could also take the MD5 hash of randomly-generated data
138Base 62 encode the MD5 hash
139Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
140There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
141Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
142The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:
143def base_encode(num, base=62):
144    digits = []
145    while num > 0
146      remainder = modulo(num, base)
147      digits.push(remainder)
148      num = divide(num, base)
149    digits = digits.reverse
150Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
151url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
152We'll use a public REST API:
153
154$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
155    "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
156Response:
157
158{
159    "shortlink": "foobar"
160}
161For internal communications, we could use Remote Procedure Calls.
162
163Use case: User enters a paste's url and views the contents
164The Client sends a get paste request to the Web Server
165The Web Server forwards the request to the Read API server
166The Read API server does the following:
167Checks the SQL Database for the generated url
168If the url is in the SQL Database, fetch the paste contents from the Object Store
169Else, return an error message for the user
170REST API:
171
172$ curl https://pastebin.com/api/v1/paste?shortlink=foobar
173Response:
174
175{
176    "paste_contents": "Hello World"
177    "created_at": "YYYY-MM-DD HH:MM:SS"
178    "expiration_length_in_minutes": "60"
179}
180Use case: Service tracks analytics of pages
181Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.
182
183Clarify with your interviewer how much code you are expected to write.
184
185class HitCounts(MRJob):
186
187    def extract_url(self, line):
188        """Extract the generated url from the log line."""
189        ...
190
191    def extract_year_month(self, line):
192        """Return the year and month portions of the timestamp."""
193        ...
194
195    def mapper(self, _, line):
196        """Parse each log line, extract and transform relevant lines.
197
198        Emit key value pairs of the form:
199
200        (2016-01, url0), 1
201        (2016-01, url0), 1
202        (2016-01, url1), 1
203        """
204        url = self.extract_url(line)
205        period = self.extract_year_month(line)
206        yield (period, url), 1
207
208    def reducer(self, key, values):
209        """Sum values for each key.
210
211        (2016-01, url0), 2
212        (2016-01, url1), 1
213        """
214        yield key, sum(values)
215Use case: Service deletes expired pastes
216To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
217
218Step 4: Scale the design
219Identify and address bottlenecks, given the constraints.
220
221Imgur
222
223Important: Do not simply jump right into the final design from the initial design!
224
225State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.
226
227It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?
228
229We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
230
231To avoid repeating discussions, refer to the following system design topics for main talking points, tradeoffs, and alternatives:
232
233DNS
234CDN
235Load balancer
236Horizontal scaling
237Web server (reverse proxy)
238API server (application layer)
239Cache
240Relational database management system (RDBMS)
241SQL write master-slave failover
242Master-slave replication
243Consistency patterns
244Availability patterns
245The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
246
247An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
248
249To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
250
2514 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:
252
253Federation
254Sharding
255Denormalization
256SQL Tuning
257We should also consider moving some data to a NoSQL Database.
258
259Additional talking points
260Additional topics to dive into, depending on the problem scope and time remaining.
261
262NoSQL
263Key-value store
264Document store
265Wide column store
266Graph database
267SQL vs NoSQL
268Caching
269Where to cache
270Client caching
271CDN caching
272Web server caching
273Database caching
274Application caching
275What to cache
276Caching at the database query level
277Caching at the object level
278When to update the cache
279Cache-aside
280Write-through
281Write-behind (write-back)
282Refresh ahead
283Asynchronism and microservices
284Message queues
285Task queues
286Back pressure
287Microservices
288Communications
289Discuss tradeoffs:
290External communication with clients - HTTP APIs following REST
291Internal communications - RPC
292Service discovery
293Security
294Refer to the security section.
295
296Latency numbers
297See Latency numbers every programmer should know.
298
299Ongoing
300Continue benchmarking and monitoring your system to address bottlenecks as they come up
301Scaling is an iterative process
302© 2020 GitHub, Inc.
303Terms
304Privacy
305Security
306Status
307Help
308Contact GitHub
309Pricing
310API
311Training
312Blog
313About
314Skip to content
315Why GitHub? 
316Enterprise
317Explore 
318Marketplace
319Pricing 
320Search
321
322Sign in
323Sign up
324donnemartin
325/
326system-design-primer
3274.3k82.4k13.8k
328 Code Issues 78 Pull requests 58 Actions Projects 0 Security Insights
329system-design-primer/solutions/system_design/pastebin/
330@kevinxuv
331@donnemartin
332kevinxuv and donnemartin zh-Hans: Translate Pastebin solution (#273)
333Latest commit
33433431e6
335on Jun 16, 2019
336Type	Name	Latest commit message	Commit time
337..		
338README-zh-Hans.md	zh-Hans: Translate Pastebin solution (#273)	8 months ago
339README.md	Enable syntax highlighting in all python code snippets (#268)	9 months ago
340__init__.py	Add Pastebin solution	3 years ago
341pastebin.graffle	Add pastebin OmniGraffle diagrams	3 years ago
342pastebin.png	Add Pastebin solution	3 years ago
343pastebin.py	Fix coding errors (#149)	2 years ago
344pastebin_basic.graffle	Add pastebin OmniGraffle diagrams	3 years ago
345pastebin_basic.png	Add Pastebin solution	3 years ago
346 README.md
347Design Pastebin.com (or Bit.ly)
348Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
349
350Design Bit.ly - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.
351
352Step 1: Outline use cases and constraints
353Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
354
355Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
356
357Use cases
358We'll scope the problem to handle only the following use cases
359User enters a block of text and gets a randomly generated link
360Expiration
361Default setting does not expire
362Can optionally set a timed expiration
363User enters a paste's url and views the contents
364User is anonymous
365Service tracks analytics of pages
366Monthly visit stats
367Service deletes expired pastes
368Service has high availability
369Out of scope
370User registers for an account
371User verifies email
372User logs into a registered account
373User edits the document
374User can set visibility
375User can set the shortlink
376Constraints and assumptions
377State assumptions
378Traffic is not evenly distributed
379Following a short link should be fast
380Pastes are text only
381Page view analytics do not need to be realtime
38210 million users
38310 million paste writes per month
384100 million paste reads per month
38510:1 read to write ratio
386Calculate usage
387Clarify with your interviewer if you should run back-of-the-envelope usage calculations.
388
389Size per paste
3901 KB content per paste
391shortlink - 7 bytes
392expiration_length_in_minutes - 4 bytes
393created_at - 5 bytes
394paste_path - 255 bytes
395total = ~1.27 KB
39612.7 GB of new paste content per month
3971.27 KB per paste * 10 million pastes per month
398~450 GB of new paste content in 3 years
399360 million shortlinks in 3 years
400Assume most are new pastes instead of updates to existing ones
4014 paste writes per second on average
40240 read requests per second on average
403Handy conversion guide:
404
4052.5 million seconds per month
4061 request per second = 2.5 million requests per month
40740 requests per second = 100 million requests per month
408400 requests per second = 1 billion requests per month
409Step 2: Create a high level design
410Outline a high level design with all important components.
411
412Imgur
413
414Step 3: Design core components
415Dive into details for each core component.
416
417Use case: User enters a block of text and gets a randomly generated link
418We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.
419
420Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.
421
422An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.
423
424The Client sends a create paste request to the Web Server, running as a reverse proxy
425The Web Server forwards the request to the Write API server
426The Write API server does the following:
427Generates a unique url
428Checks if the url is unique by looking at the SQL Database for a duplicate
429If the url is not unique, it generates another url
430If we supported a custom url, we could use the user-supplied (also check for a duplicate)
431Saves to the SQL Database pastes table
432Saves the paste data to the Object Store
433Returns the url
434Clarify with your interviewer how much code you are expected to write.
435
436The pastes table could have the following structure:
437
438shortlink char(7) NOT NULL
439expiration_length_in_minutes int NOT NULL
440created_at datetime NOT NULL
441paste_path varchar(255) NOT NULL
442PRIMARY KEY(shortlink)
443We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.1
444
445To generate the unique url, we could:
446
447Take the MD5 hash of the user's ip_address + timestamp
448MD5 is a widely used hashing function that produces a 128-bit hash value
449MD5 is uniformly distributed
450Alternatively, we could also take the MD5 hash of randomly-generated data
451Base 62 encode the MD5 hash
452Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
453There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
454Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
455The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:
456def base_encode(num, base=62):
457    digits = []
458    while num > 0
459      remainder = modulo(num, base)
460      digits.push(remainder)
461      num = divide(num, base)
462    digits = digits.reverse
463Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
464url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
465We'll use a public REST API:
466
467$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
468    "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
469Response:
470
471{
472    "shortlink": "foobar"
473}
474For internal communications, we could use Remote Procedure Calls.
475
476Use case: User enters a paste's url and views the contents
477The Client sends a get paste request to the Web Server
478The Web Server forwards the request to the Read API server
479The Read API server does the following:
480Checks the SQL Database for the generated url
481If the url is in the SQL Database, fetch the paste contents from the Object Store
482Else, return an error message for the user
483REST API:
484
485$ curl https://pastebin.com/api/v1/paste?shortlink=foobar
486Response:
487
488{
489    "paste_contents": "Hello World"
490    "created_at": "YYYY-MM-DD HH:MM:SS"
491    "expiration_length_in_minutes": "60"
492}
493Use case: Service tracks analytics of pages
494Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.
495
496Clarify with your interviewer how much code you are expected to write.
497
498class HitCounts(MRJob):
499
500    def extract_url(self, line):
501        """Extract the generated url from the log line."""
502        ...
503
504    def extract_year_month(self, line):
505        """Return the year and month portions of the timestamp."""
506        ...
507
508    def mapper(self, _, line):
509        """Parse each log line, extract and transform relevant lines.
510
511        Emit key value pairs of the form:
512
513        (2016-01, url0), 1
514        (2016-01, url0), 1
515        (2016-01, url1), 1
516        """
517        url = self.extract_url(line)
518        period = self.extract_year_month(line)
519        yield (period, url), 1
520
521    def reducer(self, key, values):
522        """Sum values for each key.
523
524        (2016-01, url0), 2
525        (2016-01, url1), 1
526        """
527        yield key, sum(values)
528Use case: Service deletes expired pastes
529To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
530
531Step 4: Scale the design
532Identify and address bottlenecks, given the constraints.
533
534Imgur
535
536Important: Do not simply jump right into the final design from the initial design!
537
538State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.
539
540It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?
541
542We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
543
544To avoid repeating discussions, refer to the following system design topics for main talking points, tradeoffs, and alternatives:
545
546DNS
547CDN
548Load balancer
549Horizontal scaling
550Web server (reverse proxy)
551API server (application layer)
552Cache
553Relational database management system (RDBMS)
554SQL write master-slave failover
555Master-slave replication
556Consistency patterns
557Availability patterns
558The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
559
560An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
561
562To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
563
5644 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:
565
566Federation
567Sharding
568Denormalization
569SQL Tuning
570We should also consider moving some data to a NoSQL Database.
571
572Additional talking points
573Additional topics to dive into, depending on the problem scope and time remaining.
574
575NoSQL
576Key-value store
577Document store
578Wide column store
579Graph database
580SQL vs NoSQL
581Caching
582Where to cache
583Client caching
584CDN caching
585Web server caching
586Database caching
587Application caching
588What to cache
589Caching at the database query level
590Caching at the object level
591When to update the cache
592Cache-aside
593Write-through
594Write-behind (write-back)
595Refresh ahead
596Asynchronism and microservices
597Message queues
598Task queues
599Back pressure
600Microservices
601Communications
602Discuss tradeoffs:
603External communication with clients - HTTP APIs following REST
604Internal communications - RPC
605Service discovery
606Security
607Refer to the security section.
608
609Latency numbers
610See Latency numbers every programmer should know.
611
612Ongoing
613Continue benchmarking and monitoring your system to address bottlenecks as they come up
614Scaling is an iterative process
615© 2020 GitHub, Inc.
616Terms
617Privacy
618Security
619Status
620Help
621Contact GitHub
622Pricing
623API
624Training
625Blog
626About
627Skip to content
628Why GitHub? 
629Enterprise
630Explore 
631Marketplace
632Pricing 
633Search
634
635Sign in
636Sign up
637donnemartin
638/
639system-design-primer
6404.3k82.4k13.8k
641 Code Issues 78 Pull requests 58 Actions Projects 0 Security Insights
642system-design-primer/solutions/system_design/pastebin/
643@kevinxuv
644@donnemartin
645kevinxuv and donnemartin zh-Hans: Translate Pastebin solution (#273)
646Latest commit
64733431e6
648on Jun 16, 2019
649Type	Name	Latest commit message	Commit time
650..		
651README-zh-Hans.md	zh-Hans: Translate Pastebin solution (#273)	8 months ago
652README.md	Enable syntax highlighting in all python code snippets (#268)	9 months ago
653__init__.py	Add Pastebin solution	3 years ago
654pastebin.graffle	Add pastebin OmniGraffle diagrams	3 years ago
655pastebin.png	Add Pastebin solution	3 years ago
656pastebin.py	Fix coding errors (#149)	2 years ago
657pastebin_basic.graffle	Add pastebin OmniGraffle diagrams	3 years ago
658pastebin_basic.png	Add Pastebin solution	3 years ago
659 README.md
660Design Pastebin.com (or Bit.ly)
661Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
662
663Design Bit.ly - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.
664
665Step 1: Outline use cases and constraints
666Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
667
668Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
669
670Use cases
671We'll scope the problem to handle only the following use cases
672User enters a block of text and gets a randomly generated link
673Expiration
674Default setting does not expire
675Can optionally set a timed expiration
676User enters a paste's url and views the contents
677User is anonymous
678Service tracks analytics of pages
679Monthly visit stats
680Service deletes expired pastes
681Service has high availability
682Out of scope
683User registers for an account
684User verifies email
685User logs into a registered account
686User edits the document
687User can set visibility
688User can set the shortlink
689Constraints and assumptions
690State assumptions
691Traffic is not evenly distributed
692Following a short link should be fast
693Pastes are text only
694Page view analytics do not need to be realtime
69510 million users
69610 million paste writes per month
697100 million paste reads per month
69810:1 read to write ratio
699Calculate usage
700Clarify with your interviewer if you should run back-of-the-envelope usage calculations.
701
702Size per paste
7031 KB content per paste
704shortlink - 7 bytes
705expiration_length_in_minutes - 4 bytes
706created_at - 5 bytes
707paste_path - 255 bytes
708total = ~1.27 KB
70912.7 GB of new paste content per month
7101.27 KB per paste * 10 million pastes per month
711~450 GB of new paste content in 3 years
712360 million shortlinks in 3 years
713Assume most are new pastes instead of updates to existing ones
7144 paste writes per second on average
71540 read requests per second on average
716Handy conversion guide:
717
7182.5 million seconds per month
7191 request per second = 2.5 million requests per month
72040 requests per second = 100 million requests per month
721400 requests per second = 1 billion requests per month
722Step 2: Create a high level design
723Outline a high level design with all important components.
724
725Imgur
726
727Step 3: Design core components
728Dive into details for each core component.
729
730Use case: User enters a block of text and gets a randomly generated link
731We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.
732
733Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.
734
735An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.
736
737The Client sends a create paste request to the Web Server, running as a reverse proxy
738The Web Server forwards the request to the Write API server
739The Write API server does the following:
740Generates a unique url
741Checks if the url is unique by looking at the SQL Database for a duplicate
742If the url is not unique, it generates another url
743If we supported a custom url, we could use the user-supplied (also check for a duplicate)
744Saves to the SQL Database pastes table
745Saves the paste data to the Object Store
746Returns the url
747Clarify with your interviewer how much code you are expected to write.
748
749The pastes table could have the following structure:
750
751shortlink char(7) NOT NULL
752expiration_length_in_minutes int NOT NULL
753created_at datetime NOT NULL
754paste_path varchar(255) NOT NULL
755PRIMARY KEY(shortlink)
756We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.1
757
758To generate the unique url, we could:
759
760Take the MD5 hash of the user's ip_address + timestamp
761MD5 is a widely used hashing function that produces a 128-bit hash value
762MD5 is uniformly distributed
763Alternatively, we could also take the MD5 hash of randomly-generated data
764Base 62 encode the MD5 hash
765Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
766There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
767Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
768The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:
769def base_encode(num, base=62):
770    digits = []
771    while num > 0
772      remainder = modulo(num, base)
773      digits.push(remainder)
774      num = divide(num, base)
775    digits = digits.reverse
776Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
777url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
778We'll use a public REST API:
779
780$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
781    "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
782Response:
783
784{
785    "shortlink": "foobar"
786}
787For internal communications, we could use Remote Procedure Calls.
788
789Use case: User enters a paste's url and views the contents
790The Client sends a get paste request to the Web Server
791The Web Server forwards the request to the Read API server
792The Read API server does the following:
793Checks the SQL Database for the generated url
794If the url is in the SQL Database, fetch the paste contents from the Object Store
795Else, return an error message for the user
796REST API:
797
798$ curl https://pastebin.com/api/v1/paste?shortlink=foobar
799Response:
800
801{
802    "paste_contents": "Hello World"
803    "created_at": "YYYY-MM-DD HH:MM:SS"
804    "expiration_length_in_minutes": "60"
805}
806Use case: Service tracks analytics of pages
807Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.
808
809Clarify with your interviewer how much code you are expected to write.
810
811class HitCounts(MRJob):
812
813    def extract_url(self, line):
814        """Extract the generated url from the log line."""
815        ...
816
817    def extract_year_month(self, line):
818        """Return the year and month portions of the timestamp."""
819        ...
820
821    def mapper(self, _, line):
822        """Parse each log line, extract and transform relevant lines.
823
824        Emit key value pairs of the form:
825
826        (2016-01, url0), 1
827        (2016-01, url0), 1
828        (2016-01, url1), 1
829        """
830        url = self.extract_url(line)
831        period = self.extract_year_month(line)
832        yield (period, url), 1
833
834    def reducer(self, key, values):
835        """Sum values for each key.
836
837        (2016-01, url0), 2
838        (2016-01, url1), 1
839        """
840        yield key, sum(values)
841Use case: Service deletes expired pastes
842To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
843
844Step 4: Scale the design
845Identify and address bottlenecks, given the constraints.
846
847Imgur
848
849Important: Do not simply jump right into the final design from the initial design!
850
851State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.
852
853It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?
854
855We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
856
857To avoid repeating discussions, refer to the following system design topics for main talking points, tradeoffs, and alternatives:
858
859DNS
860CDN
861Load balancer
862Horizontal scaling
863Web server (reverse proxy)
864API server (application layer)
865Cache
866Relational database management system (RDBMS)
867SQL write master-slave failover
868Master-slave replication
869Consistency patterns
870Availability patterns
871The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
872
873An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
874
875To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
876
8774 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:
878
879Federation
880Sharding
881Denormalization
882SQL Tuning
883We should also consider moving some data to a NoSQL Database.
884
885Additional talking points
886Additional topics to dive into, depending on the problem scope and time remaining.
887
888NoSQL
889Key-value store
890Document store
891Wide column store
892Graph database
893SQL vs NoSQL
894Caching
895Where to cache
896Client caching
897CDN caching
898Web server caching
899Database caching
900Application caching
901What to cache
902Caching at the database query level
903Caching at the object level
904When to update the cache
905Cache-aside
906Write-through
907Write-behind (write-back)
908Refresh ahead
909Asynchronism and microservices
910Message queues
911Task queues
912Back pressure
913Microservices
914Communications
915Discuss tradeoffs:
916External communication with clients - HTTP APIs following REST
917Internal communications - RPC
918Service discovery
919Security
920Refer to the security section.
921
922Latency numbers
923See Latency numbers every programmer should know.
924
925Ongoing
926Continue benchmarking and monitoring your system to address bottlenecks as they come up
927Scaling is an iterative process
928© 2020 GitHub, Inc.
929Terms
930Privacy
931Security
932Status
933Help
934Contact GitHub
935Pricing
936API
937Training
938Blog
939About