· 6 years ago · Nov 06, 2019, 06:52 AM
1Skip to content
2Why GitHub?
3Enterprise
4Explore
5Marketplace
6Pricing
7Search
8
9Sign in
10Sign up
114.1k75.9k12.4kdonnemartin/system-design-primer
12 Code Issues 67 Pull requests 51 Projects 0 Security Insights
13Join GitHub today
14GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
15
16system-design-primer/solutions/system_design/pastebin/README.md
17@manaskarekar manaskarekar Enable syntax highlighting in all python code snippets (#268)
18116634f on May 7
19@manaskarekar@donnemartin@Tarrasch@fabriziocucci@balki-server@baldassarreFe
20333 lines (240 sloc) 15.3 KB
21
22Design Pastebin.com (or Bit.ly)
23Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
24
25Design Bit.ly - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.
26
27Step 1: Outline use cases and constraints
28Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
29
30Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
31
32Use cases
33We'll scope the problem to handle only the following use cases
34User enters a block of text and gets a randomly generated link
35Expiration
36Default setting does not expire
37Can optionally set a timed expiration
38User enters a paste's url and views the contents
39User is anonymous
40Service tracks analytics of pages
41Monthly visit stats
42Service deletes expired pastes
43Service has high availability
44Out of scope
45User registers for an account
46User verifies email
47User logs into a registered account
48User edits the document
49User can set visibility
50User can set the shortlink
51Constraints and assumptions
52State assumptions
53Traffic is not evenly distributed
54Following a short link should be fast
55Pastes are text only
56Page view analytics do not need to be realtime
5710 million users
5810 million paste writes per month
59100 million paste reads per month
6010:1 read to write ratio
61Calculate usage
62Clarify with your interviewer if you should run back-of-the-envelope usage calculations.
63
64Size per paste
651 KB content per paste
66shortlink - 7 bytes
67expiration_length_in_minutes - 4 bytes
68created_at - 5 bytes
69paste_path - 255 bytes
70total = ~1.27 KB
7112.7 GB of new paste content per month
721.27 KB per paste * 10 million pastes per month
73~450 GB of new paste content in 3 years
74360 million shortlinks in 3 years
75Assume most are new pastes instead of updates to existing ones
764 paste writes per second on average
7740 read requests per second on average
78Handy conversion guide:
79
802.5 million seconds per month
811 request per second = 2.5 million requests per month
8240 requests per second = 100 million requests per month
83400 requests per second = 1 billion requests per month
84Step 2: Create a high level design
85Outline a high level design with all important components.
86
87Imgur
88
89Step 3: Design core components
90Dive into details for each core component.
91
92Use case: User enters a block of text and gets a randomly generated link
93We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.
94
95Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.
96
97An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.
98
99The Client sends a create paste request to the Web Server, running as a reverse proxy
100The Web Server forwards the request to the Write API server
101The Write API server does the following:
102Generates a unique url
103Checks if the url is unique by looking at the SQL Database for a duplicate
104If the url is not unique, it generates another url
105If we supported a custom url, we could use the user-supplied (also check for a duplicate)
106Saves to the SQL Database pastes table
107Saves the paste data to the Object Store
108Returns the url
109Clarify with your interviewer how much code you are expected to write.
110
111The pastes table could have the following structure:
112
113shortlink char(7) NOT NULL
114expiration_length_in_minutes int NOT NULL
115created_at datetime NOT NULL
116paste_path varchar(255) NOT NULL
117PRIMARY KEY(shortlink)
118We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.1
119
120To generate the unique url, we could:
121
122Take the MD5 hash of the user's ip_address + timestamp
123MD5 is a widely used hashing function that produces a 128-bit hash value
124MD5 is uniformly distributed
125Alternatively, we could also take the MD5 hash of randomly-generated data
126Base 62 encode the MD5 hash
127Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
128There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
129Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
130The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:
131def base_encode(num, base=62):
132 digits = []
133 while num > 0
134 remainder = modulo(num, base)
135 digits.push(remainder)
136 num = divide(num, base)
137 digits = digits.reverse
138Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
139url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
140We'll use a public REST API:
141
142$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
143 "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
144Response:
145
146{
147 "shortlink": "foobar"
148}
149For internal communications, we could use Remote Procedure Calls.
150
151Use case: User enters a paste's url and views the contents
152The Client sends a get paste request to the Web Server
153The Web Server forwards the request to the Read API server
154The Read API server does the following:
155Checks the SQL Database for the generated url
156If the url is in the SQL Database, fetch the paste contents from the Object Store
157Else, return an error message for the user
158REST API:
159
160$ curl https://pastebin.com/api/v1/paste?shortlink=foobar
161Response:
162
163{
164 "paste_contents": "Hello World"
165 "created_at": "YYYY-MM-DD HH:MM:SS"
166 "expiration_length_in_minutes": "60"
167}
168Use case: Service tracks analytics of pages
169Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.
170
171Clarify with your interviewer how much code you are expected to write.
172
173class HitCounts(MRJob):
174
175 def extract_url(self, line):
176 """Extract the generated url from the log line."""
177 ...
178
179 def extract_year_month(self, line):
180 """Return the year and month portions of the timestamp."""
181 ...
182
183 def mapper(self, _, line):
184 """Parse each log line, extract and transform relevant lines.
185
186 Emit key value pairs of the form:
187
188 (2016-01, url0), 1
189 (2016-01, url0), 1
190 (2016-01, url1), 1
191 """
192 url = self.extract_url(line)
193 period = self.extract_year_month(line)
194 yield (period, url), 1
195
196 def reducer(self, key, values):
197 """Sum values for each key.
198
199 (2016-01, url0), 2
200 (2016-01, url1), 1
201 """
202 yield key, sum(values)
203Use case: Service deletes expired pastes
204To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
205
206Step 4: Scale the design
207Identify and address bottlenecks, given the constraints.
208
209Imgur
210
211Important: Do not simply jump right into the final design from the initial design!
212
213State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.
214
215It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?
216
217We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
218
219To avoid repeating discussions, refer to the following system design topics for main talking points, tradeoffs, and alternatives:
220
221DNS
222CDN
223Load balancer
224Horizontal scaling
225Web server (reverse proxy)
226API server (application layer)
227Cache
228Relational database management system (RDBMS)
229SQL write master-slave failover
230Master-slave replication
231Consistency patterns
232Availability patterns
233The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
234
235An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
236
237To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
238
2394 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:
240
241Federation
242Sharding
243Denormalization
244SQL Tuning
245We should also consider moving some data to a NoSQL Database.
246
247Additional talking points
248Additional topics to dive into, depending on the problem scope and time remaining.
249
250NoSQL
251Key-value store
252Document store
253Wide column store
254Graph database
255SQL vs NoSQL
256Caching
257Where to cache
258Client caching
259CDN caching
260Web server caching
261Database caching
262Application caching
263What to cache
264Caching at the database query level
265Caching at the object level
266When to update the cache
267Cache-aside
268Write-through
269Write-behind (write-back)
270Refresh ahead
271Asynchronism and microservices
272Message queues
273Task queues
274Back pressure
275Microservices
276Communications
277Discuss tradeoffs:
278External communication with clients - HTTP APIs following REST
279Internal communications - RPC
280Service discovery
281Security
282Refer to the security section.
283
284Latency numbers
285See Latency numbers every programmer should know.
286
287Ongoing
288Continue benchmarking and monitoring your system to address bottlenecks as they come up
289Scaling is an iterative process
290© 2019 GitHub, Inc.
291Terms
292Privacy
293Security
294Status
295Help
296Contact GitHub
297Pricing
298API
299Training
300Blog
301About