M2ExZ7XH

· 7 years ago · Oct 05, 2018, 11:54 PM
1Dynamo DB
2
3Overview
4    -NoSQL, key-value/document-oriented database
5    -Serverless (no need to worry about which machine you're working with
6    -Scales massively (part of Amazon Web Services)
7    -Notable DynamoDB users: Airbnb, Lyft, Duolingo, Netflix, IMDB
8    -You specify (pay for) how many read/write requests your database should process
9    -Extremely low latency (<10ms typically, <1ms if you enable caching)
10    -Simple API (less than 20 methods, most not related to writing/reading data)
11    -Can integrate with other AWS services (CloudSearch, EMR [managing clusters], Data Pipeline [back ups])
12    -DynamoDB allocates throughput to 10gb partitions (aka "nodes")
13        *Each write capacity unit gives 1KB/s of write throughput
14        *Each read capacity unit gives 4KB/s of write throughput
15    -You can programmatically change provisioning on-the-fly
16    -From a pricing perspective, it's great for (many users + little data)
17    -From a performance perspective, it's great for extremely fast responsiveness whenever you're retrieving a document
18    
19Cons
20    -It's NoSQL
21        *Devs probably have less experience building an efficient dataset model vs. relational DB
22        *Complex queries/scans/joins are tricky and/or bad practice
23    -Need to pay per request (high request #s, especially with a large dataset size, are expensive)
24    -Need to pay continually (at enterprise-level, only exists in the cloud from Amazon)
25    -"Hot keys" (keys hit disproportionately often) are a problem
26        *Provisioning requests happens on the table level, not the request level
27        *Tables are split into partitions
28        *The total # of request is split among partitions
29        *If you have a key in one partition that is hit super often, it will exceed requests for that partition and will error out
30            ** This can easily happen if a user is frequently interacting with the same piece of data and has to hit the DB repeatedly
31        *Solutions are to either:
32            1. Increase overall provisioning (waste of $ since other partitions are fine)
33            2. Accept the "throughput exceeded" errors
34            3. Figure out how to decrease access to the hot key
35                (Some companies log whenever a key is exceeded and then deal with it later)
36    -Low visibility into database/partition utilization and performance. May need to contact AWS support to get more detailed info
37    -You can get the same effects cheaper with Apache Cassandra if you want to host it yourself
38    -You can only downscale 4 times per 24-hour-period, which makes dynamic scaling hard to pull off when you want to cut costs
39    
40Data Model:
41    -Data is stored in "tables"
42    -When you create a table, you decide on the type of key...
43        *Simple key: an attribute in the table. This is also the "partition key". Only efficient operation is store/retrieve by key
44        *Composite key: specify two attributes. One is "partition key" and one is "sort key". Can querysort//filter with the sort key.
45                (partition key, sort key) should be unique for each item.
46    -Items in each table are split into "partitions"
47        *A partition can be up to 10GB
48
49        
50Complex data:
51    -DynamoDB attributes support numbers, strings, binary values but also:
52        *Nested objects
53        *Sets (numbers, strings, binary values)
54        *Lists (untyped)
55        
56Consistency:         
57    -"Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants: any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof."
58    -You can select consistency level when you perform operations
59    -DynamoDB stores three copies of each item and when you write data to DynamoDB it only acknowledges a write after two copies out of three were updated. The third copy is updated later.
60    -When you read data from DynamoDB, you have two options:
61        *You can either use strong consistency and in this DynamoDB will read data from two copies and return the latest data, or
62        *you can select eventual consistency and in this case, DynamoDB will only read data from one copy at random, and may return stale data.
63        
64Indexes:
65    -Only two types of indexes:
66        * Local secondary index:
67            - like a composite key (another attribute)
68            - do this if you need to sort or filter on another attribute
69            - Pairs of partition/secondary index do NOT need to be unique
70        * Global secondary index
71            - Can use a different partition key for the data
72            - Use this when you want to fetch by one of two ids
73            - Can be simple or composite keys
74            - Internally, this will create a copy of the data into another table with a
75              different key. When data is written DynamoDB will copy it to the other table.
76              It will reach eventual consistency.
77    -You can have up to five local secondary indexes and five global secondary indexes *per table*
78    -Indexes are capped to 20 user-specified attributes
79    
80Supported queries:
81    -CRUD a key
82    -(Find by | Sort by | Find in range) sort field **for one partition**
83    -Scans:
84        *Allow you to search for something across partitions (scans a specified table)
85        *Inefficient, try to avoid
86    -Sort keys enable operations:
87        == < > >= <=
88        "begins with"
89        "between"
90        "contains"
91        "in"
92        sorted results
93        counts
94        (top||bottom) N values
95    
96API:
97    GetItem => get a single item by id
98    BatchGetItem => get several items by id
99    Query => query a composite key or index
100    Scan => scan through a table
101    
102    PutItem => write new item
103    BatchWriteItem => write multiple items
104    UpdateItem => update some fields in specific item
105    DeleteItem => remove item by id
106    
107    **NO METHODS WORK ACROSS DIFFERENT TABLES**    
108
109DynamoDB Mapper
110    -ORM library like Hibernate
111    -recommended way of doing work, cuts down on boilerplate code
112
113Partitions
114    -A single partition can hold approximately 10 GB of data, and can support a maximum of
115        3,000 read capacity units or 1,000 write capacity units.
116    -One read capacity unit = one strongly consistent read per second, or two eventually consistent reads per second, for items up to
117        4 KB in size.
118    -One write capacity unit = one write per second, for items up to 1 KB in size.
119    -Read/write units are rounded up to nearest (4 or 1) KB unit.
120    -# of partitions formula: ( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
121    -(Note all partitions are invisibly tripled for redundancy)
122
123Burst Capacity
124    DynamoDB provides some flexibility in your per-partition throughput provisioning by providing burst capacity, as follows. Whenever you are not fully using a partition's throughput, DynamoDB reserves a portion of that unused capacity for later bursts of throughput to handle usage spikes.
125
126    DynamoDB currently retains up to five minutes (300 seconds) of unused read and write capacity. During an occasional burst of read or write activity, these extra capacity units can be consumed quicklyâ€”even faster than the per-second provisioned throughput capacity that you've defined for your table.
127
128    (this is auto-enabled)
129    
130Adaptive Capacity
131    To better accommodate uneven access patterns, DynamoDB adaptive capacity enables your application to continue reading and writing to hot partitions without being throttled, provided that traffic does not exceed your tableâ€™s total provisioned capacity or the partition maximum capacity. Adaptive capacity works by automatically increasing throughput capacity for partitions that receive more traffic.
132    
133    (takes 5-30 minutes before this feature will kick in)
134    (this is auto-enabled)
135    (this appears to be VERY recent, blog post on this was 8/13/2018)
136    
137Limits    
138    -In USA, max configurable limits without asking support for more:
139        Per table â€“ 40,000 read capacity units and 40,000 write capacity units   (160 MB/s read, 40 MB/s write)
140        Per account â€“ 80,000 read capacity units and 80,000 write capacity units (320 MB/s read, 80 MB/s write)
141    -Partition keys must be 1-2048 bytes
142    -Sort keys must be 1-1024 bytes
143    -Strings are always UTF-8 and must be <= 400KB
144    -Numbers have 38 digits of precision
145    -Binary data must be <= 400KB
146    -Max item size is 400KB (the sum of the sizes of all of its attributes)
147    -Attribute values cannot be an empty String or an empty Set. Empty Lists/Maps are okay.
148    -Attributes can be nested 32 levels deep.
149    -You cannot have more than 10 (Create||Update||Delete)Table requests running simultaneously.
150    -BatchGetItem cannot get more than 100 items, and the total size cannot exceed 16MB.
151    -BatchWriteItem cannot exceed 25 Put/Delete requests. Total size cannot exceed 16MB.
152    -Result set from a (Query||Scan) is limited to 1MB. "You can use the LastEvaluatedKey from the query/scan response to retrieve more results."
153    
154Advanced features
155    -Optimistic locking    
156        *What if two processes try to update the same record?
157        *Recommended to use a technique similar to compare-and-swap (https://en.wikipedia.org/wiki/Compare-and-swap)
158            **Tables have "version" fields
159            **When you want to do an update:
160                1. read the item and its current version
161                2. do some processing
162                3. Increment the version # of your local copy and then try to write it
163                   Dynamo compares the version that you gave vs. the current version.
164                        If you are exactly +1 higher:
165                            * do the update
166                        Else    
167                            * if they are different, start from #1 again
168        * Why enable it?
169            1. guarantees you won't have two updates "write over each other"
170        * Why disable it?
171            1. you have to pay for multiple requests (read+write best case, read+write x2 if it fails once...)
172            2. takes longer (have to check current state before writing to it)
173            3. you have a way to ensure only one process would ever update that key 
174            4. UPDATE_SKIP_NULL_ATTRIBUTES is another way to deal with the overwriting problem (see below)
175    
176    -UPDATE_SKIP_NULL_ATTRIBUTES
177        * Will only update attributes you specified that are not null
178    
179    -Transactions
180        * DynamoDB does not support it out of the box
181        * There are extension libraries through AWS that can do it (maybe only in Java?)
182            ** Implements it by storing operations to a table and then commits all the stored operations
183    
184    -Time to Live
185        * Can specify timeframe attribute which automatically deletes an item or moves it to S3 (free feature)
186    
187    -Streams
188        * Must be enabled
189        * Provides reading of an (immutable, ordered stream of updates on a table)
190        * Best when you need to react to a change to a table, e.g. for replication/syncing/aggregation
191     
192    -Caching (DynamoDB accelerator)
193        * Must be enabled to use
194        * tricky to maintain cache consistency
195        * DAX has the same API as DynamoDB
196        * Stores data that was written to DynamoDB
197        * Provides sub-millisecond latency
198
199Best practices
200    * Distribute keys! Hot keys will mess you up real bad
201    * Feed DynamoDB from an asynchronous queue. Requeue on throughput exceptions.
202    * If you don't need to return data "live", also read asynchronously as well.
203    * Read all of the "Limits" document!
204    * Avoid complex queries!
205    * Caching can be very helpful and/or very painful
206    * Remember there's a 10GB limit per key!
207    * "The data in DynamoDB is not structured to populate a dashboard nor is it structured to work well for more complicated analytics. For these tasks, we chose two different technologies: Elasticsearch and Google BigQuery."    
208    * You may want to partition "upstream" of DynamoDB to have better control and reduce costs
209    * Batch writes whenever possible!
210    * Evenly spacing requests is best
211    * Amazon suggests turning on DynamoDB Auto Scaling
212    * AWS SDK has automatic retry functionality (don't need to write it yourself)
213    * If you enable "eventually consistent reads" you will consume only half as much read throughput
214    * Amz: "You should design your application for uniform activity across all logical partition keys in the Table and its secondary indexes"
215    * Without transactions, you cannot write to multiple tables atomically. Careful of orphan data, especially if network fails!    
216    * Best practices for X:Y relationships...
217        1:1 => normal key/value. Partition key can be the unique key.
218        1:many => table or global secondary index (GSI). Partition key is the unique key, sort key is the field you query on. e.g. personkey, birthday
219        many:many => use a table AND a GSI with partition and sort keys switched. e.g. (personkey, birthday) and (birthday, personkey)
220    * Hierarchical data options:
221        1. use composite sort key to define a Hierarchy (albumId, albumId:trackId)
222        2. store as JSON document (capped to 400KB item size)
223    * For events (eventId, timestamp), use one table per time period
224        -precreate tables (on a daily||weekly||monthly schedule depending on needs)
225        -only need provisioning on current table
226        -reduce or turn off throughput for old tables
227    * For a product catalog (many requests for a small number of items)
228        -cache popular items in your application! (select id, description from ProductCatalog where id='popularProduct')
229    * For Messaging apps (many:many)
230        -Biz case: you want to sort/filter on many different attributes (date, sender, recipient...)
231        -Partition on (recipient, date)
232        -Separate bulk data into separate tables (e.g. sender/date/msgId instead of sender/data/message)
233        -Turn attributes you want to filter on into GSI "projected attributes (e.g. recipient/Date, with msgId/recipient/date as projected attributes)
234    * For write-heavy apps (e.g. tracking voting)    
235        -separate the same concept into multiple tables (e.g. clinton_1, clinton_2)
236        -aggregate data in separate table when necessary (e.g. clinton_all)
237    
238Sources:
239    Pros/cons/opinions
240        https://read.acloud.guru/why-amazon-dynamodb-isnt-for-everyone-and-how-to-decide-when-it-s-for-you-aefc52ea9476    
241        https://syslog.ravelin.com/you-probably-shouldnt-use-dynamodb-89143c1287ca
242        https://stackoverflow.com/questions/49055206/dynamodb-save-api-optimistic-locking-and-savebehavior
243        https://medium.com/@davidmytton/aws-vs-google-cloud-flexibility-vs-operational-simplicity-dca4324b03d4
244        https://syslog.ravelin.com/scaling-a-startup-using-dynamodb-4d97b0843350
245        https://medium.com/@kevrone/a-combination-of-costs-and-lack-of-a-relational-model-and-our-choice-to-not-replicate-a-backup-due-224f676cdaea
246        https://segment.com/blog/the-million-dollar-eng-problem/
247        https://www.dailycred.com/article/dynamodb-shortcomings-and-work-arounds
248        https://www.slideshare.net/AmazonWebServices/design-patterns-using-amazon-dynamodb (released by AWS, super useful best practices)
249        
250    Overview/Facts
251        https://brewing.codes/2017/11/06/dynamodb-overview/ (best overview of Dynamo features)
252        https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBMapper.OptimisticLocking.html
253        https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/globaltables_reqs_bestpractices.html
254        https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html
255        https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ProvisionedThroughput.html
256        https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-partition-key-design.html
257        https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/     (this link is actually critical to understanding the new DynamoDB changes)
258        https://blog.codeship.com/partitioning-behavior-of-dynamodb/ (when and how partitions are created)