· 7 years ago · Oct 05, 2018, 11:54 PM
1Dynamo DB
2
3Overview
4 -NoSQL, key-value/document-oriented database
5 -Serverless (no need to worry about which machine you're working with
6 -Scales massively (part of Amazon Web Services)
7 -Notable DynamoDB users: Airbnb, Lyft, Duolingo, Netflix, IMDB
8 -You specify (pay for) how many read/write requests your database should process
9 -Extremely low latency (<10ms typically, <1ms if you enable caching)
10 -Simple API (less than 20 methods, most not related to writing/reading data)
11 -Can integrate with other AWS services (CloudSearch, EMR [managing clusters], Data Pipeline [back ups])
12 -DynamoDB allocates throughput to 10gb partitions (aka "nodes")
13 *Each write capacity unit gives 1KB/s of write throughput
14 *Each read capacity unit gives 4KB/s of write throughput
15 -You can programmatically change provisioning on-the-fly
16 -From a pricing perspective, it's great for (many users + little data)
17 -From a performance perspective, it's great for extremely fast responsiveness whenever you're retrieving a document
18
19Cons
20 -It's NoSQL
21 *Devs probably have less experience building an efficient dataset model vs. relational DB
22 *Complex queries/scans/joins are tricky and/or bad practice
23 -Need to pay per request (high request #s, especially with a large dataset size, are expensive)
24 -Need to pay continually (at enterprise-level, only exists in the cloud from Amazon)
25 -"Hot keys" (keys hit disproportionately often) are a problem
26 *Provisioning requests happens on the table level, not the request level
27 *Tables are split into partitions
28 *The total # of request is split among partitions
29 *If you have a key in one partition that is hit super often, it will exceed requests for that partition and will error out
30 ** This can easily happen if a user is frequently interacting with the same piece of data and has to hit the DB repeatedly
31 *Solutions are to either:
32 1. Increase overall provisioning (waste of $ since other partitions are fine)
33 2. Accept the "throughput exceeded" errors
34 3. Figure out how to decrease access to the hot key
35 (Some companies log whenever a key is exceeded and then deal with it later)
36 -Low visibility into database/partition utilization and performance. May need to contact AWS support to get more detailed info
37 -You can get the same effects cheaper with Apache Cassandra if you want to host it yourself
38 -You can only downscale 4 times per 24-hour-period, which makes dynamic scaling hard to pull off when you want to cut costs
39
40Data Model:
41 -Data is stored in "tables"
42 -When you create a table, you decide on the type of key...
43 *Simple key: an attribute in the table. This is also the "partition key". Only efficient operation is store/retrieve by key
44 *Composite key: specify two attributes. One is "partition key" and one is "sort key". Can querysort//filter with the sort key.
45 (partition key, sort key) should be unique for each item.
46 -Items in each table are split into "partitions"
47 *A partition can be up to 10GB
48
49
50Complex data:
51 -DynamoDB attributes support numbers, strings, binary values but also:
52 *Nested objects
53 *Sets (numbers, strings, binary values)
54 *Lists (untyped)
55
56Consistency:
57 -"Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof."
58 -You can select consistency level when you perform operations
59 -DynamoDB stores three copies of each item and when you write data to DynamoDB it only acknowledges a write after two copies out of three were updated. The third copy is updated later.
60 -When you read data from DynamoDB, you have two options:
61 *You can either use strong consistency and in this DynamoDB will read data from two copies and return the latest data, or
62 *you can select eventual consistency and in this case, DynamoDB will only read data from one copy at random, and may return stale data.
63
64Indexes:
65 -Only two types of indexes:
66 * Local secondary index:
67 - like a composite key (another attribute)
68 - do this if you need to sort or filter on another attribute
69 - Pairs of partition/secondary index do NOT need to be unique
70 * Global secondary index
71 - Can use a different partition key for the data
72 - Use this when you want to fetch by one of two ids
73 - Can be simple or composite keys
74 - Internally, this will create a copy of the data into another table with a
75 different key. When data is written DynamoDB will copy it to the other table.
76 It will reach eventual consistency.
77 -You can have up to five local secondary indexes and five global secondary indexes *per table*
78 -Indexes are capped to 20 user-specified attributes
79
80Supported queries:
81 -CRUD a key
82 -(Find by | Sort by | Find in range) sort field **for one partition**
83 -Scans:
84 *Allow you to search for something across partitions (scans a specified table)
85 *Inefficient, try to avoid
86 -Sort keys enable operations:
87 == < > >= <=
88 "begins with"
89 "between"
90 "contains"
91 "in"
92 sorted results
93 counts
94 (top||bottom) N values
95
96API:
97 GetItem => get a single item by id
98 BatchGetItem => get several items by id
99 Query => query a composite key or index
100 Scan => scan through a table
101
102 PutItem => write new item
103 BatchWriteItem => write multiple items
104 UpdateItem => update some fields in specific item
105 DeleteItem => remove item by id
106
107 **NO METHODS WORK ACROSS DIFFERENT TABLES**
108
109DynamoDB Mapper
110 -ORM library like Hibernate
111 -recommended way of doing work, cuts down on boilerplate code
112
113Partitions
114 -A single partition can hold approximately 10 GB of data, and can support a maximum of
115 3,000 read capacity units or 1,000 write capacity units.
116 -One read capacity unit = one strongly consistent read per second, or two eventually consistent reads per second, for items up to
117 4 KB in size.
118 -One write capacity unit = one write per second, for items up to 1 KB in size.
119 -Read/write units are rounded up to nearest (4 or 1) KB unit.
120 -# of partitions formula: ( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
121 -(Note all partitions are invisibly tripled for redundancy)
122
123Burst Capacity
124 DynamoDB provides some flexibility in your per-partition throughput provisioning by providing burst capacity, as follows. Whenever you are not fully using a partition's throughput, DynamoDB reserves a portion of that unused capacity for later bursts of throughput to handle usage spikes.
125
126 DynamoDB currently retains up to five minutes (300 seconds) of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed quickly—even faster than the per-second provisioned throughput capacity that you've defined for your table.
127
128 (this is auto-enabled)
129
130Adaptive Capacity
131 To better accommodate uneven access patterns, DynamoDB adaptive capacity enables your application to continue reading and writing to hot partitions without being throttled, provided that traffic does not exceed your table’s total provisioned capacity or the partition maximum capacity. Adaptive capacity works by automatically increasing throughput capacity for partitions that receive more traffic.
132
133 (takes 5-30 minutes before this feature will kick in)
134 (this is auto-enabled)
135 (this appears to be VERY recent, blog post on this was 8/13/2018)
136
137Limits
138 -In USA, max configurable limits without asking support for more:
139 Per table – 40,000 read capacity units and 40,000 write capacity units (160 MB/s read, 40 MB/s write)
140 Per account – 80,000 read capacity units and 80,000 write capacity units (320 MB/s read, 80 MB/s write)
141 -Partition keys must be 1-2048 bytes
142 -Sort keys must be 1-1024 bytes
143 -Strings are always UTF-8 and must be <= 400KB
144 -Numbers have 38 digits of precision
145 -Binary data must be <= 400KB
146 -Max item size is 400KB (the sum of the sizes of all of its attributes)
147 -Attribute values cannot be an empty String or an empty Set. Empty Lists/Maps are okay.
148 -Attributes can be nested 32 levels deep.
149 -You cannot have more than 10 (Create||Update||Delete)Table requests running simultaneously.
150 -BatchGetItem cannot get more than 100 items, and the total size cannot exceed 16MB.
151 -BatchWriteItem cannot exceed 25 Put/Delete requests. Total size cannot exceed 16MB.
152 -Result set from a (Query||Scan) is limited to 1MB. "You can use the LastEvaluatedKey from the query/scan response to retrieve more results."
153
154Advanced features
155 -Optimistic locking
156 *What if two processes try to update the same record?
157 *Recommended to use a technique similar to compare-and-swap (https://en.wikipedia.org/wiki/Compare-and-swap)
158 **Tables have "version" fields
159 **When you want to do an update:
160 1. read the item and its current version
161 2. do some processing
162 3. Increment the version # of your local copy and then try to write it
163 Dynamo compares the version that you gave vs. the current version.
164 If you are exactly +1 higher:
165 * do the update
166 Else
167 * if they are different, start from #1 again
168 * Why enable it?
169 1. guarantees you won't have two updates "write over each other"
170 * Why disable it?
171 1. you have to pay for multiple requests (read+write best case, read+write x2 if it fails once...)
172 2. takes longer (have to check current state before writing to it)
173 3. you have a way to ensure only one process would ever update that key
174 4. UPDATE_SKIP_NULL_ATTRIBUTES is another way to deal with the overwriting problem (see below)
175
176 -UPDATE_SKIP_NULL_ATTRIBUTES
177 * Will only update attributes you specified that are not null
178
179 -Transactions
180 * DynamoDB does not support it out of the box
181 * There are extension libraries through AWS that can do it (maybe only in Java?)
182 ** Implements it by storing operations to a table and then commits all the stored operations
183
184 -Time to Live
185 * Can specify timeframe attribute which automatically deletes an item or moves it to S3 (free feature)
186
187 -Streams
188 * Must be enabled
189 * Provides reading of an (immutable, ordered stream of updates on a table)
190 * Best when you need to react to a change to a table, e.g. for replication/syncing/aggregation
191
192 -Caching (DynamoDB accelerator)
193 * Must be enabled to use
194 * tricky to maintain cache consistency
195 * DAX has the same API as DynamoDB
196 * Stores data that was written to DynamoDB
197 * Provides sub-millisecond latency
198
199Best practices
200 * Distribute keys! Hot keys will mess you up real bad
201 * Feed DynamoDB from an asynchronous queue. Requeue on throughput exceptions.
202 * If you don't need to return data "live", also read asynchronously as well.
203 * Read all of the "Limits" document!
204 * Avoid complex queries!
205 * Caching can be very helpful and/or very painful
206 * Remember there's a 10GB limit per key!
207 * "The data in DynamoDB is not structured to populate a dashboard nor is it structured to work well for more complicated analytics. For these tasks, we chose two different technologies: Elasticsearch and Google BigQuery."
208 * You may want to partition "upstream" of DynamoDB to have better control and reduce costs
209 * Batch writes whenever possible!
210 * Evenly spacing requests is best
211 * Amazon suggests turning on DynamoDB Auto Scaling
212 * AWS SDK has automatic retry functionality (don't need to write it yourself)
213 * If you enable "eventually consistent reads" you will consume only half as much read throughput
214 * Amz: "You should design your application for uniform activity across all logical partition keys in the Table and its secondary indexes"
215 * Without transactions, you cannot write to multiple tables atomically. Careful of orphan data, especially if network fails!
216 * Best practices for X:Y relationships...
217 1:1 => normal key/value. Partition key can be the unique key.
218 1:many => table or global secondary index (GSI). Partition key is the unique key, sort key is the field you query on. e.g. personkey, birthday
219 many:many => use a table AND a GSI with partition and sort keys switched. e.g. (personkey, birthday) and (birthday, personkey)
220 * Hierarchical data options:
221 1. use composite sort key to define a Hierarchy (albumId, albumId:trackId)
222 2. store as JSON document (capped to 400KB item size)
223 * For events (eventId, timestamp), use one table per time period
224 -precreate tables (on a daily||weekly||monthly schedule depending on needs)
225 -only need provisioning on current table
226 -reduce or turn off throughput for old tables
227 * For a product catalog (many requests for a small number of items)
228 -cache popular items in your application! (select id, description from ProductCatalog where id='popularProduct')
229 * For Messaging apps (many:many)
230 -Biz case: you want to sort/filter on many different attributes (date, sender, recipient...)
231 -Partition on (recipient, date)
232 -Separate bulk data into separate tables (e.g. sender/date/msgId instead of sender/data/message)
233 -Turn attributes you want to filter on into GSI "projected attributes (e.g. recipient/Date, with msgId/recipient/date as projected attributes)
234 * For write-heavy apps (e.g. tracking voting)
235 -separate the same concept into multiple tables (e.g. clinton_1, clinton_2)
236 -aggregate data in separate table when necessary (e.g. clinton_all)
237
238Sources:
239 Pros/cons/opinions
240 https://read.acloud.guru/why-amazon-dynamodb-isnt-for-everyone-and-how-to-decide-when-it-s-for-you-aefc52ea9476
241 https://syslog.ravelin.com/you-probably-shouldnt-use-dynamodb-89143c1287ca
242 https://stackoverflow.com/questions/49055206/dynamodb-save-api-optimistic-locking-and-savebehavior
243 https://medium.com/@davidmytton/aws-vs-google-cloud-flexibility-vs-operational-simplicity-dca4324b03d4
244 https://syslog.ravelin.com/scaling-a-startup-using-dynamodb-4d97b0843350
245 https://medium.com/@kevrone/a-combination-of-costs-and-lack-of-a-relational-model-and-our-choice-to-not-replicate-a-backup-due-224f676cdaea
246 https://segment.com/blog/the-million-dollar-eng-problem/
247 https://www.dailycred.com/article/dynamodb-shortcomings-and-work-arounds
248 https://www.slideshare.net/AmazonWebServices/design-patterns-using-amazon-dynamodb (released by AWS, super useful best practices)
249
250 Overview/Facts
251 https://brewing.codes/2017/11/06/dynamodb-overview/ (best overview of Dynamo features)
252 https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.OptimisticLocking.html
253 https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/globaltables_reqs_bestpractices.html
254 https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
255 https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ProvisionedThroughput.html
256 https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
257 https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/ (this link is actually critical to understanding the new DynamoDB changes)
258 https://blog.codeship.com/partitioning-behavior-of-dynamodb/ (when and how partitions are created)