m5qEhehs

· 6 years ago · Jan 13, 2020, 06:50 PM
1logo
2Explore
3Problems
4Mock
5Contest
6Articles
7Discuss
8Store
93
10Back
11Design Pastebin
1249
13varyani_dinesh's avatar
14varyani_dinesh
15131
16Last Edit: October 20, 2018 11:35 AM
17
1820.4K VIEWS
19
20It was asked to me in Amazon interview for SDE-2 position.
21
22What is Pastebin?
23Pastebin web service enable users to store plain text over the network and generate unique URLs to access the uploaded data. It is also used to share data over the network quickly, as users would just need to pass the URL to let other users see it.
24
25Functional Requirements:
26
27Users should be able to upload plain text called as "Pastes".
28Upon uploading the paste user should be provided unique url to access it.
29Users should only be able to upload text.
30Url's should be expired after certain period of time, if expiration time not provided by the user.
31Non-Functional Requirements:
32
33System should be highly reliable i.e no data loss.
34System should be highly available i.e user should be able to access their paste most of the time.
35Minimum latency to fetch user pastes.
36Database Design:
37
38User table
39a. name
40b. email
41c. created_date
42d. last_login
43e. userId - PK
44
45Paste
46a. paste_name
47b. url
48c. content
49d. expiration_time
50e. created_date
51f. userId - FK
52g. pasteId - PK
53
54System APIs
55
56Can be implemented as Restful Webservices as -
57
58createPaste
59updatePaste
60getPaste
61deletePaste
62High Level Design
63
64At a high level, the system needs an application server that can serve read and write request. Application server will store paste data on block storage. All the metadata related to paste and user will be stored into a database. At a high level, various cache servers and load balancer can be configured to improve performance and scalabilty of the system.
65
660_1517558987394_High Level Design.PNG
67
68system design
69pastebin
70Comments: 12
71BestMost VotesNewest to OldestOldest to Newest
72Type comment here... (Markdown is supported)
73
74Preview
75
76Post
77yi.wang.2005's avatar
78yi.wang.2005
7922
80Last Edit: October 1, 2018 8:43 PM
81
82Read More
83I don't understand why there need to be two tables in the database. It seems that a key-value storage is sufficient -- the key is the hash of the text and the value contains the text, the upload time, and the expiration period.
84
85What is in my mind is that we need a Web server, which accepts the uploaded text and saves the text into the above key-value storage. We also need a daemon, which scans over the storage pass and pass. For each item, if it is expired (current_time >= upload_time + expiration_period), the daemon removes the item; otherwise, the daemon creates a timer to wake it up at the expiration time to remove the item.
86
87package main
88
89import (
90	"timer"
91)
92
93type TimeoutHandler struct {
94	actionList map[string]int
95	nextPass   chan int
96}
97
98func NewTimeoutHandler() *TimeoutHandler {
99	return &TimeoutHandler{
100		actionList: make(map[string]int),
101		nextPass:   make(chan int, 100),
102	}
103}
104
105func (th *TimeoutHandler) RemoveItemWhenExpired(string key, Value value) {
106	th.actionList[key] = 1
107	t := timer.NewTimer(value.ExpireTime)
108
109	go func(key string) {
110		<-t.C
111		storage.Delete(key)
112		delete(th.actionList, key)
113		if len(th.actionList) <= 0 {
114			th.nextPass <- 1 // trigger the scan of the next pass.
115		}
116		t.Close()
117	}(key)
118}
119
120func (th *TimeoutHandler) Run() {
121	for {
122		for s := storage.NewScanner(); !s.Done(); s.Next() {
123			th.RemoveItemWhenExpired(s.Key(), s.Value())
124		}
125		<-th.nextPass
126	}
127}
128
129func main() {
130	th := NewTimeoutHandler()
131	th.Run()
132}
13322
134Show 3 replies
135Reply
136Share
137Report
138flyweight2017's avatar
139flyweight2017
14024
141Last Edit: October 20, 2018 7:44 AM
142
143Read More
144@varyani_dinesh
145
146How about any additional functional/scalability requirements such as:
147
148Uniqueness and scalability of URL generation - how do we guarantee the uniqueness at massive scale, say, to accommodate thousands (or more) of simultaneous pasting requests per second? A single server to generate URL most likely would not be sufficient. But if we distribute the generation across multiple servers, what schemes should we use to ensure the URL uniqueness?
149Storage of the content - Interviewers are (or should be) more interested in how candidates would design scalable solutions for the content storage rather than just using an existing relational/nosql/object database.
150Content retrieval - In a distributed design, how do we distribute the content retrieval based on a URL?
151For any solutions with distributed nature, how do we scale up if more users join? How do we handle any of the compute/storage server failures in those solutions?
15214
153Show 5 replies
154Reply
155Share
156Report
157deex's avatar
158deex
15940
160Last Edit: October 1, 2018 7:14 PM
161
162Read More
163I think they meant to let you design this system with AWS.
164Basically, it's API Gateway -> Lambda -> DynamoDB.
165The Non-Functional Requirements can be easily achieved with this design, and you don't need to worry about how AWS achieves these.
166
1675
168Show 3 replies
169Reply
170Share
171Report
172alitaleg's avatar
173alitaleg
1747
175Last Edit: September 9, 2018 11:43 PM
176
177Read More
178I think another interesting question could be how to handle the case where multiple users paste the same text. In this case, can we avoid re-storing the same text? Hashing the text as @yi.wang.2005 suggests could be a good idea, perhaps with SHA-256.
179
180Of course, now we need to handle the case where multiple users (or even the same user) paste the same text, but want different expiries. We can have two tables, the first stores SHA-256 -> Document, and the second keeps track of Urls -> {document Hash, Expiry}
181
1826
183Show 4 replies
184Reply
185Share
186Report
187cyrois3@gmail.com's avatar
188cyrois3@gmail.com
1895
190September 14, 2018 12:06 AM
191
192Read More
193There is a non-functional requirement that has not been address: "System should be highly reliable i.e no data loss."
194
195We need something kind of like a RAID 1 (data redundancy) setup. A cassandra ring will come in handy here.
196
197There should be multiple data stores (maybe thats what the load balancers are for) that replicate data in different data centres.
198
199Different nodes on the ring stores chucks of the data, this mitigates hardware failure issues, a single corrupt hard drive will only corrupt part of the data. Partially corrupt data can be recovered from another ring or from a hash value (worst case).
200
201Suggested Video: https://www.youtube.com/watch?v=rHJ1yKN00P4
202
203
204
2051
206Reply
207Share
208Report
209vinaymaddi's avatar
210vinaymaddi
2111
212September 9, 2018 9:12 PM
213
214Read More
215How can we have load balancer between various caches/DB ?
216
2171
218Reply
219Share
220Report
221blazespinnaker's avatar
222blazespinnaker
2231
224September 9, 2018 8:58 PM
225
226Read More
227The issue is primarily one of low cost object retrieval. Content creation is a side problem which can be done a myriad of ways, depending on road map. The key is HA caching as much as possible between the user and the backend. cloud front, akamai, squid, etc. These are all the key parts here.
228
2291
230Reply
231Share
232Report
233Yunwei_Qiu's avatar
234Yunwei_Qiu
235138
236June 11, 2018 5:42 PM
237
238Read More
239Well, So I think the Application Servers need to be significantly separated to Two: 1: Upload Service 2 : Retrieve Service.
2401: Upload: Generate unique Url and send it along with text to DB, and send back Url to User, and Send url and text to Global Cache(Means the cache will be in the between Clients and Service)
241
2422: Retrieve: When receive a url from user, firstly look up the Global cache, If there is in the cache then just return it. it not, go to the DB according the "url Key" to search and then return to user.
243
2441
245Reply
246Share
247Report
248rajaramanathan's avatar
249rajaramanathan
2500
251Last Edit: April 12, 2019 2:11 AM
252
253Read More
254Interesting that no one discussed potential use of S3 as a solution. Normally people view S3 as file storage but it is actually a cloud object storage and can be used for pastes.
255
256Pastes as S3 object
257Generate UUID as key for object
258Content as object data.
259Save object in the bucket with some default expiry. (Default permission for object is public. so anyone with URL can access it)
260Return URL to access the object.
261All of the above operation can be on done on client consuming S3 service via SDK.
262
263User
264Manage user using the cognito user pool service.
265With regard to non-functional criteria...
266
267S3 has high durability
268High availability for both user pool and S3.
269Low latency within a region
2700
271Reply
272Share
273Report
274hhmd's avatar
275hhmd
2760
277April 10, 2019 1:38 PM
278
279Read More
280Is paste unique to a user?
281What is the max amount of pastes stored by a single user?
282How do we distinguish 2 different pastes from a given user?
283
284This problem can be solved by writing the paste to a document oriented sharded database like a MongoDB.
285Shard key will the UserId. Url will be the string with Hash(UserId+PasteId).
286
287TTL is implemented with a 2 prong approach - on the response path check if the paste is expired and drop the response and return 404. For proactive clean up run a lazy background job which searches for the expired documents. Having an index on the expiry time helps in faster access to the entries that needs cleanup
288
2890
290Show 1 reply
291Reply
292Share
293Report
294Copyright © 2020 LeetCode
295Help Center
296Jobs
297Bug Bounty
298Terms
299Privacy Policy
300United StatesUnited States