· 6 years ago · Jan 22, 2020, 09:44 PM
1Skip to content
2Why GitHub?
3Enterprise
4Explore
5Marketplace
6Pricing
7Search
8
9Sign in
10Sign up
11donnemartin
12/
13system-design-primer
144.3k81.4k13.6k
15 Code Issues 77 Pull requests 55 Projects 0 Security Insights
16Join GitHub today
17GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
18
19system-design-primer/solutions/system_design/pastebin/README.md
20@manaskarekar manaskarekar Enable syntax highlighting in all python code snippets (#268)
21116634f on May 7, 2019
22@manaskarekar@donnemartin@Tarrasch@fabriziocucci@balki-server@baldassarreFe
23332 lines (240 sloc) 15.3 KB
24
25Design Pastebin.com (or Bit.ly)
26Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
27
28Design Bit.ly - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.
29
30Step 1: Outline use cases and constraints
31Gather requirements and scope the problem. Ask questions to clarify use cases and constraints. Discuss assumptions.
32
33Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
34
35Use cases
36We'll scope the problem to handle only the following use cases
37User enters a block of text and gets a randomly generated link
38Expiration
39Default setting does not expire
40Can optionally set a timed expiration
41User enters a paste's url and views the contents
42User is anonymous
43Service tracks analytics of pages
44Monthly visit stats
45Service deletes expired pastes
46Service has high availability
47Out of scope
48User registers for an account
49User verifies email
50User logs into a registered account
51User edits the document
52User can set visibility
53User can set the shortlink
54Constraints and assumptions
55State assumptions
56Traffic is not evenly distributed
57Following a short link should be fast
58Pastes are text only
59Page view analytics do not need to be realtime
6010 million users
6110 million paste writes per month
62100 million paste reads per month
6310:1 read to write ratio
64Calculate usage
65Clarify with your interviewer if you should run back-of-the-envelope usage calculations.
66
67Size per paste
681 KB content per paste
69shortlink - 7 bytes
70expiration_length_in_minutes - 4 bytes
71created_at - 5 bytes
72paste_path - 255 bytes
73total = ~1.27 KB
7412.7 GB of new paste content per month
751.27 KB per paste * 10 million pastes per month
76~450 GB of new paste content in 3 years
77360 million shortlinks in 3 years
78Assume most are new pastes instead of updates to existing ones
794 paste writes per second on average
8040 read requests per second on average
81Handy conversion guide:
82
832.5 million seconds per month
841 request per second = 2.5 million requests per month
8540 requests per second = 100 million requests per month
86400 requests per second = 1 billion requests per month
87Step 2: Create a high level design
88Outline a high level design with all important components.
89
90Imgur
91
92Step 3: Design core components
93Dive into details for each core component.
94
95Use case: User enters a block of text and gets a randomly generated link
96We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.
97
98Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.
99
100An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.
101
102The Client sends a create paste request to the Web Server, running as a reverse proxy
103The Web Server forwards the request to the Write API server
104The Write API server does the following:
105Generates a unique url
106Checks if the url is unique by looking at the SQL Database for a duplicate
107If the url is not unique, it generates another url
108If we supported a custom url, we could use the user-supplied (also check for a duplicate)
109Saves to the SQL Database pastes table
110Saves the paste data to the Object Store
111Returns the url
112Clarify with your interviewer how much code you are expected to write.
113
114The pastes table could have the following structure:
115
116shortlink char(7) NOT NULL
117expiration_length_in_minutes int NOT NULL
118created_at datetime NOT NULL
119paste_path varchar(255) NOT NULL
120PRIMARY KEY(shortlink)
121We'll create an index on shortlink and created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.1
122
123To generate the unique url, we could:
124
125Take the MD5 hash of the user's ip_address + timestamp
126MD5 is a widely used hashing function that produces a 128-bit hash value
127MD5 is uniformly distributed
128Alternatively, we could also take the MD5 hash of randomly-generated data
129Base 62 encode the MD5 hash
130Base 62 encodes to [a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters
131There is only one hash result for the original input and Base 62 is deterministic (no randomness involved)
132Base 64 is another popular encoding but provides issues for urls because of the additional + and / characters
133The following Base 62 pseudocode runs in O(k) time where k is the number of digits = 7:
134def base_encode(num, base=62):
135 digits = []
136 while num > 0
137 remainder = modulo(num, base)
138 digits.push(remainder)
139 num = divide(num, base)
140 digits = digits.reverse
141Take the first 7 characters of the output, which results in 62^7 possible values and should be sufficient to handle our constraint of 360 million shortlinks in 3 years:
142url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
143We'll use a public REST API:
144
145$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \
146 "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
147Response:
148
149{
150 "shortlink": "foobar"
151}
152For internal communications, we could use Remote Procedure Calls.
153
154Use case: User enters a paste's url and views the contents
155The Client sends a get paste request to the Web Server
156The Web Server forwards the request to the Read API server
157The Read API server does the following:
158Checks the SQL Database for the generated url
159If the url is in the SQL Database, fetch the paste contents from the Object Store
160Else, return an error message for the user
161REST API:
162
163$ curl https://pastebin.com/api/v1/paste?shortlink=foobar
164Response:
165
166{
167 "paste_contents": "Hello World"
168 "created_at": "YYYY-MM-DD HH:MM:SS"
169 "expiration_length_in_minutes": "60"
170}
171Use case: Service tracks analytics of pages
172Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.
173
174Clarify with your interviewer how much code you are expected to write.
175
176class HitCounts(MRJob):
177
178 def extract_url(self, line):
179 """Extract the generated url from the log line."""
180 ...
181
182 def extract_year_month(self, line):
183 """Return the year and month portions of the timestamp."""
184 ...
185
186 def mapper(self, _, line):
187 """Parse each log line, extract and transform relevant lines.
188
189 Emit key value pairs of the form:
190
191 (2016-01, url0), 1
192 (2016-01, url0), 1
193 (2016-01, url1), 1
194 """
195 url = self.extract_url(line)
196 period = self.extract_year_month(line)
197 yield (period, url), 1
198
199 def reducer(self, key, values):
200 """Sum values for each key.
201
202 (2016-01, url0), 2
203 (2016-01, url1), 1
204 """
205 yield key, sum(values)
206Use case: Service deletes expired pastes
207To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
208
209Step 4: Scale the design
210Identify and address bottlenecks, given the constraints.
211
212Imgur
213
214Important: Do not simply jump right into the final design from the initial design!
215
216State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.
217
218It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?
219
220We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
221
222To avoid repeating discussions, refer to the following system design topics for main talking points, tradeoffs, and alternatives:
223
224DNS
225CDN
226Load balancer
227Horizontal scaling
228Web server (reverse proxy)
229API server (application layer)
230Cache
231Relational database management system (RDBMS)
232SQL write master-slave failover
233Master-slave replication
234Consistency patterns
235Availability patterns
236The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
237
238An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
239
240To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
241
2424 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:
243
244Federation
245Sharding
246Denormalization
247SQL Tuning
248We should also consider moving some data to a NoSQL Database.
249
250Additional talking points
251Additional topics to dive into, depending on the problem scope and time remaining.
252
253NoSQL
254Key-value store
255Document store
256Wide column store
257Graph database
258SQL vs NoSQL
259Caching
260Where to cache
261Client caching
262CDN caching
263Web server caching
264Database caching
265Application caching
266What to cache
267Caching at the database query level
268Caching at the object level
269When to update the cache
270Cache-aside
271Write-through
272Write-behind (write-back)
273Refresh ahead
274Asynchronism and microservices
275Message queues
276Task queues
277Back pressure
278Microservices
279Communications
280Discuss tradeoffs:
281External communication with clients - HTTP APIs following REST
282Internal communications - RPC
283Service discovery
284Security
285Refer to the security section.
286
287Latency numbers
288See Latency numbers every programmer should know.
289
290Ongoing
291Continue benchmarking and monitoring your system to address bottlenecks as they come up
292Scaling is an iterative process
293© 2020 GitHub, Inc.
294Terms
295Privacy
296Security
297Status
298Help
299Contact GitHub
300Pricing
301API
302Training
303Blog
304About