TSyxzB6C

· 6 years ago · Feb 05, 2020, 04:20 AM
1# AWS Well-Architected Framework
2<br>
3
4**July 2019**
5
6This document describes the AWS Well-Architected Framework, which enables
7you to review and improve your cloud-based architectures and better
8understand the business impact of your design decisions. We address general
9design principles as well as specific best practices and guidance in five
10conceptual areas that we define as the *pillars* of the Well-Architected
11Framework.<br>
12
13## ? Introduction
14
15The AWS Well-Architected Framework `helps you understand the pros and cons of
16decisions you make while building systems on AWS.`<br>
17
18By using the Framework you will **learn architectural best practices** for
19designing and operating reliable, secure, efficient, and cost-effective
20systems in the cloud.<br>
21
22It provides a way for you to consistently measure your architectures against
23best practices and identify areas for improvement.<br>
24
25The process for reviewing an architecture is a **constructive conversation
26about architectural decisions**, and is not an audit mechanism.<br>
27
28We believe that having well-architected systems `greatly increases the 
29likelihood of business success.`<br><br>
30
31**AWS Solutions Architects** have years of experience architecting solutions
32across a wide variety of business verticals and use cases.<br>
33
34We have `helped design and review thousands of customers' architectures on
35AWS.`<br>
36
37From this experience, we have identified best practices and core strategies
38for architecting systems in the cloud.<br><br>
39
40The **AWS Well-Architected Framework** `documents a set of foundational 
41questions that allow you to understand if a specific architecture aligns
42well with cloud best practices.`<br>
43
44The framework provides a **consistent approach to evaluating systems against
45the qualities you expect** from modern cloud-based systems, and the
46remediation that would be required to achieve those qualities.<br>
47
48As AWS continues to evolve, and we continue to learn more from working
49with our customers, we will continue to refine the definition of
50well-architected.<br><br>
51
52This **framework is intended for those in technology roles**, such as chief
53technology officers (CTOs), architects, developers, and operations team
54members.<br>
55
56It `describes AWS best practices and strategies to use when designing and
57operating a cloud workload`, and provides links to further implementation
58details and architectural patterns.<br>
59
60For more information, see the
61[AWS Well-Architected homepage](https://aws.amazon.com/architecture/well-architected/?ref=wellarchitected-wp).<br><br>
62
63AWS also provides a service for reviewing your workloads at no charge.<br>
64
65The 
66[AWS Well-Architected Tool](https://aws.amazon.com/well-architected-tool/?ref=wellarchitected-wp)
67(**AWS WA Tool**) is `a service in the cloud that provides a consistent process
68for you to review and measure your architecture using the AWS 
69Well-Architected Framework`.<br>
70
71The AWS WA Tool provides recommendations for **making your workloads more
72reliable, secure, efficient, and cost-effective**.<br><br>
73
74To help you apply best practices, we have created
75[AWS Well-Architected Labs](https://wellarchitectedlabs.com/?ref=wellarchitected-wp),
76which **provides you with a repository of code and documentation** to give
77you hands-on experience implementing best practices.<br>
78
79We also have teamed up with select AWS Partner Network (APN) Partners, 
80who are members of the
81[AWS Well-Architected Partner program](https://aws.amazon.com/architecture/well-architected/partners/?ref=wellarchitected-wp).<br>
82
83These **APN Partners have deep AWS knowledge**, and can `help you review and
84improve your workloads.`<br><br>
85
86### ? Definitions
87
88Every day experts at AWS assist customers in architecting systems to take
89advantage of best practices in the cloud.<br>
90We work with you on making architectural trade-offs as your designs
91evolve.<br>
92
93As you deploy these systems into live environments, we learn how well 
94these systems perform and the consequences of those trade-offs.<br><br>
95
96Based on what we have learned we have created the AWS Well-Architected
97Framework, which provides a **consistent set of best practices for
98customers and partners to evaluate architectures**, and provides a **set of
99questions you can use to evaluate** how well an architecture is aligned to
100AWS best practices.<br><br>
101
102The AWS Well-Architected Framework is based on **five pillars** — `operational
103excellence, security, reliability, performance efficiency, and cost
104optimization.`<br>
105
106#### ✽ Operational Excellence
107
108> The ability to run and monitor systems to deliver business value and to
109continually improve supporting processes and procedures.<br>
110
111#### ✽ Security
112
113> The ability to protect information, systems, and assets while delivering
114business value through risk assessments and mitigation strategies.<br>
115
116#### ✽ Reliability
117
118> The ability of a system to recover from infrastructure or service
119disruptions, dynamically acquire computing resources to meet demand, and
120mitigate disruptions such as misconfigurations or transient network
121issues.<br>
122
123#### ✽ Performance Efficiency
124
125> The ability to use computing resources efficiently to meet system
126requirements, and to maintain that efficiency as demand changes and
127technologies evolve.<br>
128
129#### ✽ Cost Optimization
130
131> The ability to run systems to deliver business value at the lowest
132price point.
133<br>
134
135In the AWS Well-Architected Framework, we use these terms
136
137* A **component** is the `code, configuration and AWS Resources that together
138deliver against a requirement.` A component is often the unit of technical
139ownership, and is decoupled from other components.
140
141* We use the term **workload** to identify `a set of components that together
142deliver business value.` The workload is usually the level of detail that
143business and technology leaders communicate about.
144
145* We think about **architecture** as `being how components work together in a
146workload.` How components communicate and interact is often the focus of
147architecture diagrams.
148
149* **Milestones** `mark key changes in your architecture as it evolves
150throughout the product lifecycle` (design, testing, go live, and in
151production).
152 
153* Within an organization the **technology portfolio** is the `collection of
154workloads that are required for the business to operate.`<br><br>
155
156When architecting workloads you **make trade-offs between pillars based 
157upon your business context**.<br>
158These `business decisions can drive your engineering priorities`.<br>
159
160You might optimize to reduce cost at the expense of reliability in
161development environments, or, for mission-critical solutions, you might
162optimize reliability with increased costs.<br>
163
164In ecommerce solutions, performance can affect revenue and customer
165propensity to buy.<br>
166
167**Security and operational excellence** `are generally not traded-off against
168the other pillars.`<br><br>
169
170### ? On Architecture
171
172In on-premises environments, **customers often have a central team for
173technology architecture** that acts as an overlay to other product or
174feature teams to ensure they are following best practice.<br>
175
176`Technology architecture teams are often composed of a set of roles` such
177as Technical Architect (infrastructure), Solutions Architect (software), 
178Data Architect, Networking Architect, and Security Architect.<br>
179
180Often these teams use 
181[TOGAF](http://pubs.opengroup.org/architecture/togaf9-doc/arch/?ref=wellarchitected-wp)
182or the
183[Zachman Framework](https://www.zachman.com/about-the-zachman-framework?ref=wellarchitected-wp)
184as part of an enterprise architecture capability.<br><br>
185
186**At AWS, we prefer to distribute capabilities into teams** rather than having
187a centralized team with that capability.<br>
188
189There are **risks when you choose to distribute decision making authority**, 
190for example, ensuring that teams are meeting internal standards.<br>
191
192We `mitigate these risks in two ways`:
1931. First, we have ***practices***<sup>1</sup> that focus on enabling each team
194to have that capability, and we put in place experts who ensure that teams
195raise the bar on the standards they need to meet.<br>
196
1972. Second, we implement ***mechanisms***<sup>2</sup> that carry out automated
198checks to ensure standards are being met.<br>
199
200This distributed approach is supported by the
201[Amazon leadership principles](), 
202and establishes a culture across all roles that *works back*<sup>3</sup>
203from the customer.<br>
204
205Customer-obsessed teams build products in response to a customer need.<br><br>
206
207For architecture this means that we **expect every team to have the
208capability to create architectures and to follow best practices**.<br>
209
210To help new teams gain these capabilities or existing teams to raise 
211their bar, we `enable access to a virtual community of principal engineers`
212who can review their designs and help them understand what AWS best
213practices are.<br>
214
215The principal engineering community works to make best practices visible
216and accessible.<br>
217
218One way they do this, for example, is through **lunchtime talks** that focus
219on applying best practices to real examples.<br>
220These talks are recorded and can be used as part of onboarding materials
221for new team members.<br><br>
222
223♣ AWS best practices emerge from our experience running thousands of
224systems at internet scale.<br>
225
226♣ We prefer to use data to define best practice, but we also use subject
227matter experts like principal engineers to set them.<br>
228
229♣ As principal engineers see new best practices emerge, they work as a 
230community to ensure that teams follow them.<br>
231
232♣ In time, these best practices are formalized into our internal review
233processes, as well as into mechanisms that enforce compliance.<br>
234
235♣ **Well-Architected is the customer-facing implementation of our internal
236review process**, where we have codified our principal engineering thinking
237across field roles like Solutions Architecture and internal engineering
238teams.<br>
239
240♣ **Well-Architected is a scalable mechanism** that lets you take advantage
241of these learnings.<br><br>
242
243♣ By following the approach of a principal engineering community with
244distributed ownership of architecture, we believe that a Well-Architected
245enterprise architecture can emerge that is driven by customer need.<br>
246
247♣ Technology leaders (such as CTOs or development managers), carrying out
248Well-Architected reviews across all your workloads will allow you to
249`better understand the risks in your technology portfolio.`<br>
250
251♣ Using this approach, you can **identify themes across teams that your
252organization could address** by mechanisms, trainings, or lunchtime talks
253where your principal engineers can share their thinking on specific areas
254with multiple teams.<br><br>
255
256<sup>1</sup> Ways of doing things, processes, standards, and accepted norms.<br>
257
258<sup>2</sup> *"Good intentions never work, you need good mechanisms to make
259anything happen"* — Jeff Bezos. This means replacing human best efforts
260with mechanisms (often automated) that check for compliance with rules or
261processes.<br>
262
263<sup>3</sup> Working backward is a fundamental part of our innovation 
264process. We start with the customer and what they want, & let that 
265define and guide our efforts.<br><br>
266
267### ⚒ General Design Principles
268
269The Well-Architected Framework identifies a set of general design
270principles to facilitate good design in the cloud:
271
272✱ **Stop guessing your capacity needs**:<br>
2731. Eliminate guessing about your infrastructure capacity needs. 
2742. When you make a capacity decision before you deploy a system, you 
275might end up sitting on expensive idle resources or dealing with the 
276performance implications of limited capacity. 
2773. With cloud computing, these problems can go away. 
2784. You can use as much or as little capacity as you need, & scale up 
279and down automatically.<br><br>
280
281✱ **Test systems at production scale**:<br>
2821. In the cloud, you can create a production-scale test environment on
283demand, complete your testing, and then decommission the resources.
2842. Because you only pay for the test environment when it's running, you 
285can simulate your live environment for a fraction of the cost of testing
286on premises.<br><br>
287
288✱ **Automate to make architectural experimentation easier**:<br>
2891. Automation allows you to create and replicate your systems at low cost
290and avoid the expense of manual effort.
2912. You can track changes to your automation, audit the impact, and revert
292to previous parameters when necessary.<br><br>
293
294✱ **Allow for evolutionary architectures**:<br>
2951. In a traditional environment, architectural decisions are often
296implemented as static, one-time events, with a few major versions of a
297system during its lifetime.
2982. As a business and its context continue to change, these initial
299decisions might hinder the system's ability to deliver changing business
300requirements.
3013. In the cloud, the capability to automate and test on demand lowers
302the risk of impact from design changes.
3034. This allows systems to evolve over time so that businesses can take
304advantage of innovations as a standard practice.<br><br>
305
306✱ **Drive architectures using data**:<br>
3071. In the cloud you can collect data on how your architectural choices
308affect the behavior of your workload.
3092. This lets you make fact-based decisions on how to improve your
310workload.
3113. Your cloud infrastructure is code, so you can use that data to inform
312your architecture choices and improvements over time.<br><br>
313
314✱ **Improve through game days**:<br>
3151. Test how your architecture and processes perform by regularly
316scheduling game days to simulate events in production.
3172. This will help you understand where improvements can be made and can
318help develop organizational experience in dealing with events.<br><br>
319
320## ? The Five Pillars of the Framework
321
322Creating a software system is a lot like constructing a building.<br>
323If the foundation is not solid, structural problems can undermine the
324integrity and function of the building.<br>
325
326When architecting technology solutions, **if you neglect the five pillars**
327of operational excellence, security, reliability, performance efficiency, 
328and cost optimization, `it can become challenging to build a system that
329delivers on your expectations and requirements`.<br>
330
331Incorporating these pillars into your architecture will **help you produce
332stable and efficient systems**.<br>
333
334This will allow you to focus on the other aspects of design, such as
335functional requirements.<br><br>
336
337### ✨ Operational Excellence
338
339The **Operational Excellence** pillar includes `the ability to run and
340monitor systems to deliver business value & to continually improve
341supporting processes and procedures.`<br>
342
343The operational excellence pillar provides an overview of design principles,
344best practices, and questions.<br>
345
346You can find prescriptive guidance on implementation in the
347[Operational Excellence Pillar whitepaper](https://d0.awsstatic.com/whitepapers/architecture/AWS-Operational-Excellence-Pillar.pdf?ref=wellarchitected-wp).<br><br>
348
349#### ? A. Design Principles
350
351There are `six design principles` for operational excellence in the cloud:<br>
352
353❆ **Perform operations as code**:<br>
3541. In the cloud, you can apply the same engineering discipline that you
355use for application code to your entire environment.
3562. You can <code>define your entire workload (applications, infrastructure) as
357code and update it with code.</code>
3583. You can <code>implement your operations procedures as code and automate
359their execution</code> by triggering them in response to events.
3604. By performing operations as code, you limit human error and enable
361consistent responses to events.<br><br>
362
363❆ **Annotate documentation**:<br>
3641. In an on-premises environment, documentation is created by hand, used
365by people, and hard to keep in sync with the pace of change.
3662. In the cloud, you can <code>automate the creation of annotated documentation
367after every build</code> (or automatically annotate hand-crafted documentation).
3683. Annotated documentation can be used by people and systems.
3694. Use annotations as an input to your operations code.<br><br>
370
371❆ **Make frequent, small, reversible changes**:<br>
3721. Design workloads to `allow components to be updated regularly.`
3732. `Make changes in small increments that can be reversed` if they fail
374(without affecting customers when possible).<br><br>
375
376❆ **Refine operations procedures frequently**:<br>
3771. As you use operations procedures, <code>look for opportunities to improve
378them.</code>
3792. As you evolve your workload, <code>evolve your procedures appropriately.</code>
3803. <code>Set up regular game days</code> to review and validate that all procedures
381are effective and that teams are familiar with them.<br><br>
382
383❆ **Anticipate failure**:<br>
3841. <code>Perform "pre-mortem" exercises</code> to identify potential sources of 
385failure so that they can be removed or mitigated.
3862. <code>Test your failure scenarios</code> and validate your understanding of their
387impact.
3883. <code>Test your response procedures</code> to ensure that they are effective, and
389that teams are familiar with their execution.
3904. <code>Set up regular game days</code> to test workloads and team responses to
391simulated events.<br><br>
392
393❆ **Learn from all operational failures**:<br>
3941. <code>Drive improvement through lessons learned</code> from all operational events
395and failures.
3962. <code>Share what is learned</code> across teams and through the entire organization.<br><br>
397
398#### ? B. Definition
399
400There are `three best practice areas for operational excellence in the
401cloud`:<br>
402
403⌖ **Prepare**<br>
404
405⌖ **Operate**<br>
406
407⌖ **Evolve**<br><br>
408
409☛ Operations teams `need to understand their business and customer needs`
410so they can effectively and efficiently support business outcomes.<br>
411
412☛ Operations `creates and uses procedures to respond to operational events`
413and validates their effectiveness to support business needs.<br>
414
415☛ Operations `collects metrics` that are used to measure the achievement
416of desired business outcomes.<br>
417
418☛ `Everything continues to change` — your business context, business
419priorities, customer needs, etc.<br>
420
421☛ It's important to `design operations to support evolution over time` in
422response to change and to incorporate lessons learned through their
423performance.<br><br>
424
425#### ? C. Best Practices
426
427#### ❶ *Prepare*
428
429Effective preparation is required to drive operational excellence.<br>
430
431Business success is enabled by shared goals and understanding across the
432business, development, and operations.<br>
433
434**Common standards** simplify workload design and management, enabling 
435operational success.<br>
436
437? Design workloads with mechanisms to monitor and gain insight into
438application, platform, and infrastructure components, as well as customer
439experience and behavior.<br><br>
440
441? Create mechanisms to validate that workloads, or changes, are ready to be
442moved into production and supported by operations.<br>
443
444**Operational readiness is validated through checklists** to ensure a 
445workload meets defined standards & that required procedures are
446adequately captured in runbooks and playbooks.<br>
447
448? Validate that there are sufficient trained personnel to effectively
449support the workload.<br>
450
451? Prior to transition, test responses to operational events and failures.<br>
452
453? Practice responses in supported environments through failure injection 
454and game day events.<br><br>
455
456**AWS enables operations as code in the cloud** and the ability to safely
457experiment, develop operations procedures, and practice failure.<br>
458
459Using **AWS CloudFormation** `enables you to have consistent, templated, 
460sandbox development, test, and production environments with increasing
461levels of operations control.`<br>
462
463AWS enables visibility into your workloads at all layers through various
464log collection and monitoring features.<br>
465
466`Data on use of resources, application programming interfaces (APIs), and
467network flow logs can be collected` using **Amazon CloudWatch, AWS 
468CloudTrail, and VPC Flow Logs**.<br>
469
470You can use the **collectd plugin**, or the **CloudWatch Logs agent**, to 
471`aggregate information about the operating system into CloudWatch.`<br><br>
472
473The following questions focus on these considerations for operational
474excellence.<br>
475(For a list of operational excellence questions, answers, and best 
476practices, see the **Appendix**.)<br><br>
477
478? **OPS 1**: **How do you determine what your priorities are?**<br>
479
480> Everyone needs to understand their part in enabling business success.
481Have shared goals in order to set priorities for resources. This will
482maximize the benefits of your efforts. [(hello)]()<br>
483<br>
484
485? **OPS 2**: **How do you design your workload so that you can 
486understand its state?**<br>
487
488> Design your workload so that it provides the information necessary for
489you to understand its internal state (for example, metrics, logs, and
490traces) across all components. This enables you to provide effective
491responses when appropriate. [(hello)]()<br>
492<br>
493
494? **OPS 3**: **How do you reduce defects, ease remediation, and improve
495flow into production?**<br>
496
497> Adopt approaches that improve flow of changes into production, that
498enable refactoring, fast feedback on quality, and bug fixing. These
499accelerate beneficial changes entering production, limit issues deployed, 
500& enable rapid identification and remediation of issues introduced
501through deployment activities. [(hello)]()<br>
502<br>
503
504? **OPS 4**: **How do you mitigate deployment risks?**<br>
505
506> Adopt approaches that provide fast feedback on quality and enable
507rapid recovery from changes that do not have desired outcomes. Using
508these practices mitigates the impact of issues introduced through the
509deployment of changes. [(hello)]()<br>
510<br>
511
512? **OPS 5**: **How do you know that you are ready to support a workload?**<br>
513
514> Evaluate the operational readiness of your workload, processes and
515procedures, & personnel to understand the operational risks related to
516your workload. [(hello)]()<br>
517<br>
518
519? Implement the minimum number of architecture standards for your
520workloads.<br>
521
522? Balance the cost to implement a standard against the benefit to the
523workload and the burden upon operations.<br>
524
525? Reduce the number of supported standards to reduce the chance that
526lower-than-acceptable standards will be applied by error.<br>
527
528Operations personnel are often constrained resources.<br>
529
530? Invest in implementing operations activities as code to maximize the
531productivity of operations personnel, minimize error rates, and enable
532automated responses.<br>
533
534? Adopt deployment practices that take advantage of the elasticity of
535the cloud to facilitate pre-deployment of systems for faster
536implementations.<br><br>
537
538#### ❷ *Operate*
539
540`Successful operation of a workload is measured by the achievement of
541business and customer outcomes.`<br>
542
543? **Define expected outcomes**, determine how success will be measured, &
544identify the workload and operations metrics that will be used in those
545calculations to determine if operations are successful.<br>
546
547Consider that operational health includes both the health of the workload
548& the health and success of the operations acting upon the workload (for
549example, deployment and incident response).<br>
550
551? **Establish baselines** from which improvement or degradation of operations
552will be identified, collect and analyze your metrics, & then validate
553your understanding of operations success and how it changes over time.<br>
554
555Use collected metrics to determine if you are satisfying customer and 
556business needs, & identify areas for improvement.<br><br>
557
558`Efficient and effective management of operational events is required to
559achieve operational excellence.`<br>
560This applies to both planned and unplanned operational events.<br>
561
562? **Use established runbooks** for well-understood events, **and use playbooks** to
563aid in the resolution of other events.<br>
564
565? **Prioritize responses to events** based on their business and customer impact.<br>
566
567Ensure that if an alert is raised in response to an event, there is an
568associated process to be executed, with a specifically identified owner.<br>
569
570? **Define in advance the personnel required** to resolve an event and include
571escalation triggers to engage additional personnel, as it becomes
572necessary, based on impact (that is, duration, scale, and scope).<br>
573
574? **Identify and engage individuals with the authority to decide** on courses
575of action where there will be a business impact from an event response
576not previously addressed.<br><br>
577
578? **Communicate the operational status of workloads** through dashboards and
579notifications that are tailored to the target audience (for example, 
580customer, business, developers, operations) so that 
5811. They may take appropriate action,
5822. Their expectations are managed, and
5833. They are informed when normal operations resume.<br><br>
584
585? **Determine the root cause** of unplanned events and unexpected impacts from
586planned events.<br>
587
588This information will be used to update your procedures to mitigate
589future occurrence of events.<br>
590
591Communicate root cause with affected communities as appropriate.<br><br>
592
593In AWS, you can `generate dashboard views of your metrics collected from
594workloads and natively from AWS.`<br>
595
596You can leverage **CloudWatch or third-party applications** to `aggregate and
597present business, workload, and operations level views of operations
598activities.`<br>
599
600AWS provides `workload insights through logging capabilities` including
601**AWS X-Ray, CloudWatch, CloudTrail, and VPC Flow Logs** enabling the
602`identification of workload issues in support of root cause analysis and
603remediation.`<br><br>
604
605The following questions focus on these considerations for operational
606excellence.<br><br>
607
608? **OPS 6**: **How do you understand the health of your workload?**<br>
609
610> Define, capture, and analyze workload metrics to gain visibility to
611workload events so that you can take appropriate action. [(hello)]()<br>
612<br>
613
614? **OPS 7**: **How do you understand the health of your operations?**<br>
615
616> Define, capture, and analyze operations metrics to gain visibility to
617operations events so that you can take appropriate action. [(hello)]()<br>
618<br>
619
620? **OPS 8**: **How do you manage workload and operations events?**<br>
621
622> Prepare and validate procedures for responding to events to minimize
623their disruption to your workload. [(hello)]()<br>
624<br>
625
626Routine operations, as well as responses to unplanned events, should be
627automated.<br>
628
629Manual processes for deployments, release management, changes, and 
630rollbacks should be avoided.<br>
631
632? **Releases should not be large batches** that are done infrequently.<br>
633Rollbacks are more difficult in large changes.<br>
634
635Failing to have a rollback plan, or the ability to mitigate failure
636impacts, will prevent continuity of operations.<br>
637
638? **Align metrics to business needs** so that responses are effective at
639maintaining business continuity.<br>
640
641One-time decentralized metrics with manual responses will result in
642greater disruption to operations during unplanned events.<br><br>
643
644#### ❸ *Evolve*
645
646`Evolution of operations is required to sustain operational excellence.`<br>
647
648Dedicate work cycles to making continuous incremental improvements.<br>
649
650? **Regularly evaluate and prioritize opportunities for improvement** (for
651example, feature requests, issue remediation, and compliance requirements), 
652including both the workload and operations procedures.<br>
653
654? **Include feedback loops within your procedures** to rapidly identify areas
655for improvement and capture learnings from the execution of operations.<br><br>
656
657? **Share lessons learned** across teams to share the benefits of those lessons.<br>
658
659? **Analyze trends within lessons learned** and perform cross-team 
660retrospective analysis of operations metrics to identify opportunities
661and methods for improvement.<br>
662
663? **Implement changes** intended to bring about improvement and evaluate the
664results to determine success.<br><br>
665
666With **AWS Developer Tools**, you can `implement continuous delivery build, 
667test, and deployment activities that work with a variety of source code, 
668build, testing, and deployment tools from AWS and third parties.`<br>
669
670The results of deployment activities can be used to identify
671opportunities for improvement for both deployment and development.<br>
672
673? You can **perform analytics on your metrics data** integrating data from your
674operations and deployment activities, to enable analysis of the impact of
675those activities against business and customer outcomes.<br>
676
677This data can be leveraged in **cross-team retrospective analysis** to
678identify opportunities and methods for improvement.<br><br>
679
680The following questions focus on these considerations for operational
681excellence.<br><br>
682
683? **OPS 9**: **How do you evolve operations?**<br>
684
685> Dedicate time and resources for continuous incremental improvement to
686evolve the effectiveness and efficiency of your operations. [(hello)]()<br>
687<br>
688
689`Successful evolution of operations is founded in`: 
6901. Frequent small improvements
6912. Providing safe environments and time to experiment, develop
692, and test improvements, &
6933. Environments in which learning from failures is encouraged.<br>
694
695Operations support for sandbox, development, test, and production 
696environments, with increasing level of operational controls, facilitates
697development and increases the predictability of successful results from
698changes deployed into production.<br><br>
699
700#### ? D. Key AWS Services
701
702The AWS service that is essential to Operational Excellence is **AWS
703CloudFormation**, which you can `use to create templates based on best
704practices.`<br>
705
706This enables you to `provision resources in an orderly and consistent
707fashion from your development through production environments.`<br>
708
709The following services and features support the three areas in 
710operational excellence:
711
712* **Prepare**:<br> 
713**AWS Config and AWS Config rules** can be used to create standards for 
714workloads and to determine if environments are compliant with those 
715standards before being put into production.<br>
716
717* **Operate**:<br>
718**Amazon CloudWatch** allows you to monitor the operational health of a 
719workload.<br>
720
721* **Evolve**:<br>
722**Amazon Elasticsearch Service (Amazon ES)** allows you to analyze your log 
723data to gain actionable insights quickly and securely.<br><br>
724
725#### ? E. Resources
726
727Refer to the following resources to learn more about our best practices
728for Operational Excellence.<br>
729
730**Documentation**<br>
731
732❁ [DevOps and AWS](https://aws.amazon.com/devops/?ref=wellarchitected-wp)<br><br>
733
734**Whitepaper**<br>
735
736❁ [Operational Excellence Pillar](https://d0.awsstatic.com/whitepapers/architecture/AWS-Operational-Excellence-Pillar.pdf?ref=wellarchitected-wp)<br><br>
737
738**Video**<br>
739
740❁ [DevOps at Amazon](https://www.youtube.com/watch?v=esEFaY0FDKc&ref=wellarchitected-wp)<br><br>
741
742### ?️‍♀️ Security
743
744The **Security** pillar includes `the ability to protect information, 
745systems, and assets while delivering business value through risk
746assessments and mitigation strategies.`<br>
747
748The security pillar provides an overview of design principles, best 
749practices, and questions.<br>
750
751You can find prescriptive guidance on implementation in the
752[Security Pillar whitepaper](https://d0.awsstatic.com/whitepapers/architecture/AWS-Security-Pillar.pdf?ref=wellarchitected-wp).<br><br>
753
754#### ? A. Design Principles
755
756There are `seven design principles` for security in the cloud:
757
758❆ **Implement a strong identity foundation**:<br>
7591. Implement the `principle of least privilege` and <code>enforce separation of
760duties</code> with appropriate authorization for each interaction with your
761AWS resources.
7622. <code>Centralize privilege management</code> and reduce or even eliminate reliance
763on long-term credentials.<br><br>
764
765❆ **Enable traceability**:<br>
7661. `Monitor, alert, & audit actions and changes` to your environment in
767real time.
7682. `Integrate logs and metrics with systems` to automatically respond and
769take action.<br><br>
770
771❆ **Apply security at all layers**:<br>
7721. Rather than just focusing on protection of a single outer layer, apply
773a `defense-in-depth approach` with other security controls.
7742. `Apply to all layers` (e.g., edge network, VPC, subnet, load balancer, 
775every instance, operating system, and application).<br><br>
776
777❆ **Automate security best practices**:<br>
7781. `Automated software-based security mechanisms` improve your ability to
779securely scale more rapidly and cost effectively.
7802. `Create secure architectures`, including the implementation of controls
781that are defined and managed as code in version-controlled templates.<br><br>
782
783❆ **Protect data in transit and at rest**:<br>
7841. `Classify your data into sensitivity levels and use mechanisms`, such as
785encryption, tokenization, and access control where appropriate.<br><br>
786
787❆ **Keep people away from data**:<br>
7881. `Create mechanisms and tools` to reduce or eliminate the need for direct
789access or manual processing of data.
7902. This reduces the risk of loss or modification and human error when
791handling sensitive data.<br><br>
792
793❆ **Prepare for security events**:<br>
7941. Prepare for an incident by having an `incident management process` that
795aligns to your organizational requirements.
7962. `Run incident response simulations and use tools with automation` to
797increase your speed for detection, investigation, and recovery.<br><br>
798
799#### ? B. Definition
800
801There are `five best practice areas for security in the cloud`:<br>
802
803⌖ **Identity and Access Management**<br>
804
805⌖ **Detective Controls**<br>
806
807⌖ **Infrastructure Protection**<br>
808
809⌖ **Data Protection**<br>
810
811⌖ **Incident Response**<br><br>
812
813☛ Before you architect any system, you need to `put in place practices 
814that influence security.`<br>
815You will want to control who can do what.<br>
816
817☛ In addition, `you want to be able to`
8181. Identify security incidents, 
8192. Protect your systems and services, and 
8203. Maintain the confidentiality and integrity of data through data 
821protection.<br>
822
823☛ You should have a `well-defined and practiced process` for responding 
824to security incidents.<br>
825
826These tools and techniques are important because they support objectives
827such as preventing financial loss or complying with regulatory 
828obligations.<br><br>
829
830The **AWS Shared Responsibility Model** enables organizations that adopt 
831the cloud to achieve their security and compliance goals.<br>
832
833Because AWS physically secures the infrastructure that supports our cloud
834services, as an AWS customer you can focus on using services to 
835accomplish your goals.<br>
836
837☛ The AWS Cloud also provides `greater access to security data` and an
838`automated approach to responding to security events.`<br><br>
839
840#### ? C. Best Practices
841
842#### ❶ *Identity and Access Management*
843
844? Identity and access management `are key parts of an information security
845program`, ensuring that only authorized and authenticated users are able
846to access your resources, and only in a manner that you intend.<br>
847
848? For example, you should **define principals** (that is, users, groups,
849services, and roles that take action in your account), **build out policies**
850aligned with these principals, and **implement strong credential
851management**.<br>
852
853`These privilege-management elements form the core of authentication and
854authorization.`<br><br>
855
856In AWS, privilege management is primarily supported by the **AWS Identity
857and Access Management (IAM)** service, which allows you to `control user and
858programmatic access to AWS services and resources.`<br>
859
860You should apply **granular policies**, which assign permissions to a user, 
861group, role, or resource.<br>
862
863You also have the ability to require **strong password practices**, such as
864complexity level, avoiding re-use, and enforcing multi-factor
865authentication (MFA).<br>
866
867You can **use federation with your existing directory service**.<br>
868
869For workloads that require systems to have access to AWS, `IAM enables
870secure access through` roles, instance profiles, identity federation, and
871temporary credentials.<br><br>
872
873The following questions focus on these considerations for security.<br>
874(For a list of security questions, answers, and best practices, see the
875**Appendix**.)<br><br>
876
877? **SEC 1**: **How do you manage credentials and authentication?**<br>
878
879> Credentials and authentication mechanisms include passwords, tokens, 
880and keys that grant access directly or indirectly in your workload. 
881Protect credentials with appropriate mechanisms to help reduce the risk
882of accidental or malicious use. [(hello)]()<br>
883<br>
884
885? **SEC 2**: **How do you control human access?**<br>
886
887> Control human access by implementing controls inline with defined
888business requirements to reduce risk and lower the impact of unauthorized
889access. This applies to privileged users and administrators of your AWS
890account, and also applies to end users of your application. [(hello)]()<br>
891<br>
892
893? **SEC 3**: **How do you control programmatic access?**<br>
894
895> Control programmatic or automated access with appropriately defined, 
896limited, and segregated access to help reduce the risk of unauthorized
897access. Programmatic access includes access that is internal to your
898workload, and access to AWS related resources. [(hello)]()<br>
899<br>
900
901Credentials must not be shared between any user or system.<br>
902
903? **User access** should be granted using a `least-privilege approach` with best
904practices including password requirements and MFA enforced.<br>
905
906? **Programmatic access** including API calls to AWS services should be
907performed using `temporary and limited-privilege credentials` such as 
908those issued by the **AWS Security Token Service**.<br><br>
909
910AWS provides resources that can help you with identity and access
911management.<br>
912
913To help learn best practices, explore our hands-on labs on
914[managing credentials & authentication](https://wellarchitectedlabs.com/Security/Quest_Managing_Credentials_and_Authentication/README.html?ref=wellarchitected-wp),
915[controlling human access](https://wellarchitectedlabs.com/Security/Quest_Control_Human_Access/README.html?ref=wellarchitected-wp), and
916[controlling programmatic access](https://wellarchitectedlabs.com/Security/Quest_Control_Programmatic_Access/README.html?ref=wellarchitected-wp).<br><br>
917
918#### ❷ *Detective Controls*
919
920You can use detective controls to `identify a potential security threat
921or incident.`<br>
922
923? They are an **essential part of governance frameworks** and can be used to
924support a quality process, a legal or compliance obligation, & for
925threat identification and response efforts.<br>
926
927There are different types of detective controls.<br>
928
929**For example**, conducting an inventory of assets and their detailed 
930attributes promotes more effective decision making (and lifecycle 
931controls) to help establish operational baselines.<br>
932
933? You can also use **internal auditing**, an examination of controls related to
934information systems, to ensure that practices meet policies and 
935requirements & that you have set the correct automated alerting
936notifications based on defined conditions.<br>
937
938? These controls are **important reactive factors** that can help your
939organization identify and understand the scope of anomalous activity.<br><br>
940
941In AWS, you can implement detective controls by `processing logs, events, 
942and monitoring that allows for auditing, automated analysis, and alarming.`<br>
943
944**CloudTrail logs, AWS API calls, and CloudWatch** provide monitoring of
945metrics with alarming, & **AWS Config** provides configuration history.<br>
946
947**Amazon GuardDuty** is `a managed threat detection service` that continuously
948monitors for malicious or unauthorized behavior to `help you protect your
949AWS accounts and workloads.`<br>
950
951Service-level logs are also available, for example, you can use Amazon
952Simple Storage Service (**Amazon S3**) `to log access requests.`<br><br>
953
954The following questions focus on these considerations for security.<br><br>
955
956? **SEC 4**: **How do you detect and investigate security events?**<br>
957
958> Capture and analyze events from logs and metrics to gain visibility.
959Take action on security events and potential threats to help secure your
960workload. [(hello)]()<br>
961<br>
962
963? **SEC 5**: **How do you defend against emerging security threats?**<br>
964
965> Staying up to date with AWS and industry best practices & threat
966intelligence helps you be aware of new risks. This enables you to create
967a threat model to identify, prioritize, and implement appropriate 
968controls to help protect your workload. [(hello)]()<br>
969<br>
970
971? **Log management is important to a well-architected design** for reasons
972ranging from security or forensics to regulatory or legal requirements.<br>
973
974It is critical that you analyze logs and respond to them so that you can
975`identify potential security incidents.`<br>
976
977? AWS provides functionality that makes **log management easier to implement**
978by giving you the ability to define a data-retention lifecycle or define
979where data will be preserved, archived, or eventually deleted.<br>
980
981This makes predictable and reliable `data handling simpler and more cost
982effective.`<br><br>
983
984#### ❸ *Infrastructure Protection*
985
986Infrastructure protection `encompasses control methodologies, such as
987defense in depth, necessary to meet best practices and organizational or
988regulatory obligations.`<br>
989
990Use of these methodologies is **critical for successful, ongoing 
991operations** in either the cloud or on-premises.<br><br>
992
993? In AWS, you can implement **stateful and stateless packet inspection**, 
994either by using AWS-native technologies or by using partner products and
995services available through the AWS Marketplace.<br>
996
997? You should use **Amazon Virtual Private Cloud (Amazon VPC)** to `create a 
998private, secured, and scalable environment in which you can define your
999topology` — including gateways, routing tables, & public and private
1000subnets.<br><br>
1001
1002The following questions focus on these considerations for security.<br><br>
1003
1004? **SEC 6**: **How do you protect your networks?**<br>
1005
1006> Public and private networks require multiple layers of defense to help
1007protect from external and internal network-based threats. [(hello)]()<br>
1008<br>
1009
1010? **SEC 7**: **How do you protect your compute resources?**<br>
1011
1012> Compute resources in your workload require multiple layers of defense
1013to help protect from external and internal threats. Compute resources
1014include EC2 instances, containers, AWS Lambda functions, database
1015services, IoT devices, and more. [(hello)]()<br>
1016<br>
1017
1018? **Multiple layers of defense** are advisable in any type of environment.<br>
1019
1020In the case of infrastructure protection, many of the concepts and
1021methods are valid across cloud and on-premises models.<br>
1022
1023? `Factors essential to an effective information security plan`:<br>
10241. Enforcing boundary protection, 
10252. Monitoring points of ingress and egress, and
10263. Comprehensive logging, monitoring, and alerting<br><br>
1027
1028? **AWS customers are able to tailor, or harden**, the configuration of an
1029Amazon Elastic Compute Cloud (Amazon EC2), Amazon EC2 Container Service 
1030(Amazon ECS) container, or AWS Elastic Beanstalk instance, and **persist
1031this configuration to an immutable Amazon Machine Image (AMI)**.<br>
1032
1033Then, whether triggered by Auto Scaling or launched manually, all new
1034virtual servers (instances) launched with this AMI receive the hardened
1035configuration.<br><br>
1036
1037#### ❹ *Data Protection*
1038
1039`Before architecting any system, foundational practices that influence
1040security should be in place.`<br>
1041
1042? For example, **data classification** provides a way to categorize 
1043organizational data based on levels of sensitivity, and **encryption**
1044protects data by way of rendering it unintelligible to unauthorized
1045access.<br>
1046
1047These tools and techniques are important because they support objectives
1048such as preventing financial loss or complying with regulatory
1049obligations.<br><br>
1050
1051**In AWS, the following practices** `facilitate protection of data`:<br>
1052
1053❆ As an AWS customer you maintain **full control over your data**.<br><br>
1054
1055❆ AWS makes it **easier for you to encrypt your data and manage keys**, 
1056including regular key rotation, which can be easily automated by AWS
1057or maintained by you.<br><br>
1058
1059❆ **Detailed logging** that contains important content, such as file access
1060and changes, is available.<br><br>
1061
1062❆ AWS has **designed storage systems for exceptional resiliency**. For 
1063example, Amazon S3 Standard, S3 Standard-IA, S3 One Zone-IA, and Amazon
1064Glacier are all designed to provide **99.999999999% durability of objects
1065over a given year**. This durability level corresponds to an average 
1066annual expected loss of 0.000000001% of objects.<br><br>
1067
1068❆ **Versioning**, which can be part of a larger data lifecycle management
1069process, can `protect against accidental overwrites, deletes, and
1070similar harm`.<br><br>
1071
1072❆ **AWS never initiates the movement of data between Regions**. Content
1073placed in a Region will remain in that Region unless you explicitly
1074enable a feature or leverage a service that provides that functionality.<br><br>
1075
1076The following questions focus on these considerations for security.<br><br>
1077
1078? **SEC 8**: **How do you classify your data?**<br>
1079
1080> Classification provides a way to categorize data, based on levels of
1081sensitivity, to help you determine appropriate protective and retention
1082controls. [(hello)]()<br>
1083<br>
1084
1085? **SEC 9**: **How do you protect your data at rest?**<br>
1086
1087> Protect your data at rest by defining your requirements and
1088implementing controls, including encryption, to reduce the risk of
1089unauthorized access or loss. [(hello)]()<br>
1090<br>
1091
1092? **SEC 10**: **How do you protect your data in transit?**<br>
1093
1094> Protecting your data in transit by defining your requirements and
1095implementing controls, including encryption, reduces the risk of
1096unauthorized access or exposure. [(hello)]()<br>
1097<br>
1098
1099AWS provides multiple means for encrypting data at rest and in transit.<br>
1100
1101We build features into our services that make it easier to encrypt your
1102data.<br>
1103
1104? For example, we have implemented **server-side encryption (SSE) for Amazon
1105S3** to make it `easier for you to store your data in an encrypted form.`<br>
1106
1107? You can also arrange for the entire HTTPS encryption and decryption
1108process (generally known as **SSL termination**) to be handled by **Elastic
1109Load Balancing (ELB)**.<br><br>
1110
1111#### ❺ *Incident Response*
1112
1113? Even with extremely mature preventive and detective controls, `your
1114organization should still put processes in place to` **respond to and
1115mitigate the potential impact of security incidents**.<br>
1116
1117? The `architecture of your workload strongly affects the ability of your
1118teams`
11191. To operate effectively during an incident, 
11202. To isolate or contain systems, and 
11213. To restore operations to a known good state.<br>
1122
1123? Putting in place the tools and access ahead of a security incident, then
1124routinely practicing incident response through game days, will help you
1125`ensure that your architecture can accommodate timely investigation and
1126recovery`.<br><br>
1127
1128**In AWS, the following practices** `facilitate effective incident response`:<br>
1129
1130❆ **Detailed logging** is available that contains important content, such as
1131file access and changes.<br>
1132
1133❆ **Events can be automatically processed and trigger tools** that automate
1134responses through the use of AWS APIs.<br>
1135
1136❆ You can pre-provision tooling and a **"clean room" using AWS 
1137CloudFormation**. This allows you to `carry out forensics in a safe, 
1138isolated environment.`<br><br>
1139
1140The following questions focus on these considerations for security.<br><br>
1141
1142? **SEC 11**: **How do you respond to an incident?**<br>
1143
1144> Preparation is critical to timely investigation and response to 
1145security incidents to help minimize potential disruption to your
1146organization. [(hello)]()<br>
1147<br>
1148
1149? Ensure that you have a way to **quickly grant access for your InfoSec 
1150team**, and automate the isolation of instances as well as the capturing 
1151of data and state for forensics.<br><br>
1152
1153#### ? D. Key AWS Services
1154
1155The AWS service that is essential to Security is **AWS Identity and Access
1156Management (IAM)**, which allows you to `securely control access to AWS
1157services and resources for your users.`<br>
1158
1159The following services and features support the five areas in security:<br>
1160
1161* **Identity and Access Management**:<br>
11621. **IAM** enables you to securely control access to AWS services and
1163resources.
11642. **MFA** adds an additional layer of protection on user access.
11653. **AWS Organizations** lets you centrally manage and enforce policies for
1166multiple AWS accounts.<br><br>
1167
1168* **Detective Controls**:<br>
11691. **AWS CloudTrail** records AWS API calls, **AWS Config** provides a detailed
1170inventory of your AWS resources and configuration.
11712. **Amazon GuardDuty** is a managed threat detection service that
1172continuously monitors for malicious or unauthorized behavior.
11733. **Amazon CloudWatch** is a monitoring service for AWS resources which can
1174trigger CloudWatch Events to automate security responses.<br><br>
1175
1176* **Infrastructure Protection**:<br>
11771. **Amazon Virtual Private Cloud (Amazon VPC)** enables you to launch AWS
1178resources into a virtual network that you've defined.
11792. **Amazon CloudFront** is a global content delivery network that securely
1180delivers data, videos, applications, and APIs to your viewers which
1181integrates with **AWS Shield** for DDoS mitigation.
11823. **AWS WAF** is a web application firewall that is deployed on either
1183Amazon CloudFront or Application Load Balancer to help protect your web
1184applications from common web exploits.<br><br>
1185
1186* **Data Protection**:<br>
11871. Services such as ELB, Amazon Elastic Block Store (Amazon EBS), Amazon
1188S3, and Amazon Relational Database Service (Amazon RDS) include
1189encryption capabilities to protect your data in transit and at rest.
11902. **Amazon Macie** automatically discovers, classifies and protects 
1191sensitive data, while **AWS Key Management Service (Amazon KMS)** makes it
1192easy for you to create and control keys used for encryption.<br><br>
1193
1194* **Incident Response**:<br>
11951. **IAM** should be used to grant appropriate authorization to incident
1196response teams and response tools.
11972. **AWS CloudFormation** can be used to create a trusted environment or
1198clean room for conducting investigations.
11993. **Amazon CloudWatch Events** allows you to create rules that trigger
1200automated responses including AWS Lambda.<br><br>
1201
1202#### ? E. Resources
1203
1204Refer to the following resources to learn more about our best practices
1205for Security.<br>
1206
1207**Documentation**<br>
1208
1209❁ [AWS Cloud Security](http://aws.amazon.com/security/?ref=wellarchitected-wp)<br>
1210❁ [AWS Compliance](https://aws.amazon.com/compliance/?ref=wellarchitected-wp)<br>
1211❁ [AWS Security Blog](http://blogs.aws.amazon.com/security/?ref=wellarchitected-wp)<br><br>
1212
1213**Whitepaper**<br>
1214
1215❁ [Security Pillar](https://d0.awsstatic.com/whitepapers/architecture/AWS-Security-Pillar.pdf?ref=wellarchitected-wp)<br>
1216❁ [AWS Security Overview](https://d0.awsstatic.com/whitepapers/Security/AWS%20Security%20Whitepaper.pdf?ref=wellarchitected-wp)<br>
1217❁ [AWS Security Best Practices](https://aws.amazon.com/whitepapers/aws-security-best-practices/?ref=wellarchitected-wp)<br>
1218❁ [AWS Risk and Compliance](https://d0.awsstatic.com/whitepapers/compliance/AWS_Risk_and_Compliance_Whitepaper.pdf?ref=wellarchitected-wp)<br><br>
1219
1220**Video**<br>
1221
1222❁ [AWS Security State of the Union](https://youtu.be/Wvyc-VEUOns?ref=wellarchitected-wp)<br>
1223❁ [Shared Responsibility Overview](https://www.youtube.com/watch?v=U632-ND7dKQ&ref=wellarchitected-wp)<br><br>
1224
1225### ? Reliability
1226
1227The **Reliability** pillar includes `the ability of a system to` 
12281. Recover from infrastructure or service disruptions, 
12292. Dynamically acquire computing resources to meet demand, and 
12303. Mitigate disruptions such as misconfigurations or transient network issues.<br>
1231
1232The reliability pillar provides an overview of design principles, best
1233practices, and questions.<br>
1234
1235You can find prescriptive guidance on implementation in the
1236[Reliability Pillar whitepaper](https://d0.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf?ref=wellarchitected-wp).<br><br>
1237
1238#### ? A. Design Principles
1239
1240There are `five design principles` for reliability in the cloud:<br>
1241
1242❆ **Test recovery procedures**:<br>
12431. In an on-premises environment, testing is often conducted to prove 
1244the system works in a particular scenario.
12452. Testing is not typically used to validate recovery strategies.
12463. In the cloud, you can `test how your system fails`, and you can 
1247validate your recovery procedures.
12484. You can `use automation to simulate different failures` or to recreate
1249scenarios that led to failures before.
12505. This `exposes failure pathways that you can test and rectify` before a
1251real failure scenario, reducing the risk of components failing that have
1252not been tested before.<br><br>
1253
1254❆ **Automatically recover from failure**:<br>
12551. By monitoring a system for `key performance indicators (KPIs)`, you can
1256trigger automation when a threshold is breached.
12572. This allows for `automatic notification and tracking` of failures, &
1258for `automated recovery processes` that work around or repair the failure.
12593. With more sophisticated automation, it's possible to <code>anticipate and
1260remediate failures</code> before they occur.<br><br>
1261
1262❆ **Scale horizontally to increase aggregate system availability**:<br>
12631. Replace one large resource with multiple small resources to reduce
1264the impact of a single failure on the overall system.
12652. `Distribute requests across multiple, smaller resources` to ensure that
1266they don't share a common point of failure.<br><br>
1267
1268❆ **Stop guessing capacity**:<br>
12691. A common cause of failure in on-premises systems is <code>resource 
1270saturation</code>, when the demands placed on a system exceed the capacity of
1271that system (this is often the objective of denial of service attacks).
12722. In the cloud, you can `monitor demand and system utilization`, and
1273`automate the addition or removal of resources` to maintain the optimal
1274level to satisfy demand without over- or under-provisioning.<br><br>
1275
1276❆ **Manage change in automation**:<br>
12771. Changes to your infrastructure should be done using automation.
12782. The changes that need to be managed are changes to the automation.<br><br>
1279
1280#### ? B. Definition
1281
1282There are `three best practice areas for reliability in the cloud`:<br>
1283
1284⌖ **Foundations**<br>
1285
1286⌖ **Change Management**<br>
1287
1288⌖ **Failure Management**<br><br>
1289
1290☛ To achieve reliability, a system `must have a well-planned foundation 
1291and monitoring in place`, with mechanisms for handling changes in demand 
1292or requirements.<br>
1293
1294☛ The system should be `designed to detect failure and automatically heal
1295itself.`<br><br>
1296
1297#### ? C. Best Practices
1298
1299#### ❶ *Foundations*
1300
1301`Before architecting any system, foundational requirements that influence
1302reliability should be in place.`<br>
1303
1304**For example**, you must have sufficient network bandwidth to your data
1305center.<br>
1306
1307? These **requirements are sometimes neglected** (because they are beyond a 
1308single project's scope).<br>
1309This neglect can have a `significant impact on the ability to deliver a 
1310reliable system.`<br>
1311
1312? In an on-premises environment, these requirements `can cause long lead
1313times` due to dependencies and therefore **must be incorporated during
1314initial planning**.<br><br>
1315
1316With AWS, most of these foundational requirements are already
1317incorporated or may be addressed as needed.<br>
1318
1319? The cloud is designed to be essentially limitless, so it is the
1320**responsibility of AWS to satisfy the requirement** for sufficient
1321networking and compute capacity, while **you are free to change resource
1322size and allocation**, such as the size of storage devices, on demand.<br><br>
1323
1324The following questions focus on these considerations for reliability.<br>
1325For a list of reliability questions, answers, and best practices, see 
1326the **Appendix**.)<br><br>
1327
1328? **REL 1**: **How do you manage service limits?**<br>
1329
1330> Default service limits exist to prevent accidental provisioning of
1331more resources than you need. There are also limits on how often you can
1332call API operations to protect services from abuse. If you are using **AWS
1333Direct Connect**, you have limits on the amount of data you can transfer 
1334on each connection. If you are using **AWS Marketplace applications**, you
1335need to understand the limitations of the applications. If you are using
1336third-party web services or software as a service, you also need to be
1337aware of the limits of those services. [(hello)]()<br>
1338<br>
1339
1340? **REL 2**: **How do you manage your network topology?**<br>
1341
1342> Applications can exist in one or more environments: your existing data
1343center infrastructure, publicly accessible public cloud infrastructure, 
1344or private addressed public cloud infrastructure. Network considerations
1345such as intra- and inter-system connectivity, public IP address
1346management, private address management, and name resolution are 
1347fundamental to using resources in the cloud. [(hello)]()<br>
1348<br>
1349
1350? AWS sets **service limits** (an upper limit on the number of each resource
1351your team can request) to protect you from accidentally 
1352over-provisioning resources.<br>
1353
1354? You will need to have **governance and processes in place** to `monitor and
1355change these limits to meet your business needs.`<br>
1356
1357? As you adopt the cloud, you may need to plan integration with existing
1358on-premises resources (**a hybrid approach**).<br>
1359
1360A hybrid model enables the gradual transition to an all-in-cloud 
1361approach over time.<br>
1362
1363Therefore, it's important to have a design for how your AWS and 
1364on-premises resources will interact as a network topology.<br><br>
1365
1366#### ❷ *Change Management*
1367
1368? Being aware of how change affects a system allows you to **plan 
1369proactively**, and **monitoring** allows you to `quickly identify trends` that
1370could lead to capacity issues or SLA breaches.<br>
1371
1372? In traditional environments, **change-control processes are often manual**
1373and `must be carefully coordinated with auditing` to effectively control
1374who makes changes and when they are made.<br><br>
1375
1376? Using AWS, you can **monitor the behavior of a system and automate the
1377response to KPIs**, for example, by adding additional servers as a system
1378gains more users.<br>
1379
1380You can control who has permission to make system changes and audit the
1381history of these changes.<br><br>
1382
1383The following questions focus on these considerations for reliability.<br><br>
1384
1385? **REL 3**: **How does your system adapt to changes in demand?**<br>
1386
1387> A scalable system provides elasticity to add and remove resources
1388automatically so that they closely match the current demand at any given
1389point in time. [(hello)]()<br>
1390<br>
1391
1392? **REL 4**: **How do you monitor your resources?**<br>
1393
1394> Logs and metrics are a powerful tool to gain insight into the health
1395of your workloads. You can configure your workload to monitor logs and
1396metrics & send notifications when thresholds are crossed or significant
1397events occur. Ideally, when low-performance thresholds are crossed or
1398failures occur, the workload has been architected to automatically
1399self-heal or scale in response. [(hello)]()<br>
1400<br>
1401
1402? **REL 5**: **How do you implement change?**<br>
1403
1404> Uncontrolled changes to your environment make it difficult to predict
1405the effect of a change. Controlled changes to provisioned resources and
1406workloads are necessary to ensure that the workloads and the operating
1407environment are running known software and can be patched or replaced in
1408a predictable manner. [(hello)]()<br>
1409<br>
1410
1411? When you architect a system to automatically add and remove resources 
1412in response to changes in demand, this not only **increases 
1413reliability** but also **ensures that business success doesn't become a 
1414burden**.<br>
1415
1416With monitoring in place, your team will be `automatically alerted when
1417KPIs deviate` from expected norms.<br>
1418
1419? **Automatic logging of changes** to your environment allows you to audit and
1420quickly identify actions that might have impacted reliability.<br>
1421
1422Controls on change management ensure that you can `enforce the rules
1423that deliver the reliability you need.`<br><br>
1424
1425#### ❸ *Failure Management*
1426
1427In any system of reasonable complexity it is expected that failures will
1428occur.<br>
1429
1430? It is **generally of interest to know** `how to become aware of these 
1431failures, respond to them, and prevent them from happening again.`<br><br>
1432
1433**With AWS**, you can `take advantage of automation` to react to monitoring
1434data.<br>
1435
1436**For example**, when a particular metric crosses a threshold, you can
1437trigger an automated action to remedy the problem.<br>
1438
1439? Also, `rather than trying to diagnose and fix a failed resource` that is
1440part of your production environment, `you can replace it with a new one` 
1441and carry out the analysis on the failed resource out of band.<br>
1442
1443? Since the cloud enables you to stand up temporary versions of a whole
1444system at low cost, you can **use automated testing to verify full
1445recovery processes**.<br><br>
1446
1447The following questions focus on these considerations for reliability.<br><br>
1448
1449? **REL 6**: **How do you back up data?**<br>
1450
1451> Back up data, applications, and operating environments (defined as
1452operating systems configured with applications) to meet requirements
1453for mean time to recovery (MTTR) and recovery point objectives (RPO). 
1454[(hello)]()<br>
1455<br>
1456
1457? **REL 7**: **How does your system withstand component failures?**<br>
1458
1459> If your workloads have a requirement, implicit or explicit, for high
1460availability and low mean time to recovery (MTTR), architect your
1461workloads for resilience and distribute your workloads to withstand
1462outages. [(hello)]()<br>
1463<br>
1464
1465? **REL 8**: **How do you test resilience?**<br>
1466
1467> Test the resilience of your workload to help you find latent bugs that
1468only surface in production. Exercise these tests regularly. [(hello)]()<br>
1469<br>
1470
1471? **REL 9**: **How do you plan for disaster recovery?**<br>
1472
1473> Disaster recovery (DR) is critical should restoration of data be 
1474required from backup methods. Your definition of and execution on the
1475objectives, resources, locations, and functions of this data must align
1476with RTO and RPO objectives. [(hello)]()<br>
1477<br>
1478
1479? **Regularly back up your data and test your backup files** to ensure you can
1480recover from both logical and physical errors.<br>
1481
1482? A **key to managing failure** is the frequent and automated testing of 
1483systems to cause failure, and then observe how they recover.<br>
1484
1485Do this on a regular schedule and ensure that such testing is also
1486triggered after significant system changes.<br>
1487
1488? **Actively track KPIs**, such as the recovery time objective (RTO) and 
1489recovery point objective (RPO), `to assess a system's resiliency` 
1490(especially under failure-testing scenarios).<br>
1491
1492Tracking KPIs will help you identify and mitigate single points of
1493failure.<br>
1494
1495? The objective is to **thoroughly test your system-recovery processes** so
1496that you are confident that you can recover all your data and continue
1497to serve your customers, even in the face of sustained problems.<br>
1498
1499Your recovery processes should be `as well exercised as your normal
1500production processes.`<br><br>
1501
1502#### ? D. Key AWS Services
1503
1504The AWS service that is essential to Reliability is **Amazon CloudWatch**, 
1505which `monitors runtime metrics.`<br>
1506
1507The following services and features support the three areas in
1508reliability:<br>
1509
1510* **Foundations**:<br>
15111. **AWS IAM** enables you to securely control access to AWS services and
1512resources.
15132. **Amazon VPC** lets you provision a private, isolated section of the AWS
1514Cloud where you can launch AWS resources in a virtual network.
15153. **AWS Trusted Advisor** provides visibility into service limits.
15164. **AWS Shield** is a managed Distributed Denial of Service (DDoS)  
1517protection service that safeguards web applications running on AWS.<br><br>
1518
1519* **Change Management**:<br>
15201. **AWS CloudTrail** records AWS API calls for your account and delivers
1521log files to you for auditing.
15222. **AWS Config** provides a detailed inventory of your AWS resources and
1523configuration, and continuously records configuration changes.
15243. **Amazon Auto Scaling** is a service that will provide an automated
1525demand management for a deployed workload.
15264. **Amazon CloudWatch** provides the ability to alert on metrics, including
1527custom metrics.
15285. Amazon CloudWatch `also has a logging feature` that can be used to
1529aggregate log files from your resources.<br><br>
1530
1531* **Failure Management**:<br>
15321. **AWS CloudFormation** provides templates for the creation of AWS
1533resources and provisions them in an orderly and predictable fashion.
15342. **Amazon S3** provides a highly durable service to keep backups.
15353. **Amazon Glacier** provides highly durable archives.
15364. **AWS KMS** provides a reliable key management system that integrates 
1537with many AWS services.<br><br>
1538
1539#### ? E. Resources
1540
1541Refer to the following resources to learn more about our best practices
1542for Reliability.<br>
1543
1544**Documentation**<br>
1545
1546❁ [Service Limits](http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html?ref=wellarchitected-wp)<br>
1547❁ [Service Limits Reports Blog](http://aws.amazon.com/about-aws/whats-new/2014/06/19/amazon-ec2-service-limits-report-now-available/?ref=wellarchitected-wp)<br>
1548❁ [Amazon Virtual Private Cloud](http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Introduction.html?ref=wellarchitected-wp)<br>
1549❁ [AWS Shield](http://docs.aws.amazon.com/waf/latest/developerguide/shield-chapter.html?ref=wellarchitected-wp)<br>
1550❁ [Amazon CloudWatch](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html?ref=wellarchitected-wp)<br>
1551❁ [Amazon S3](http://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html?ref=wellarchitected-wp)<br>
1552❁ [AWS KMS](http://docs.aws.amazon.com/kms/latest/developerguide/overview.html?ref=wellarchitected-wp)<br><br>
1553
1554**Whitepaper**<br>
1555
1556❁ [Reliability Pillar](https://d0.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf?ref=wellarchitected-wp)<br>
1557❁ [Backup Archive and Restore Approach Using AWS](http://d0.awsstatic.com/whitepapers/Backup_Archive_and_Restore_Approaches_Using_AWS.pdf?ref=wellarchitected-wp)<br>
1558❁ [Managing your AWS Infrastructure at Scale](http://d0.awsstatic.com/whitepapers/managing-your-aws-infrastructure-at-scale.pdf?ref=wellarchitected-wp)<br>
1559❁ [AWS Disaster Recovery](http://media.amazonwebservices.com/AWS_Disaster_Recovery.pdf?ref=wellarchitected-wp)<br>
1560❁ [AWS Amazon VPC Connectivity Options](http://media.amazonwebservices.com/AWS_Amazon_VPC_Connectivity_Options.pdf?ref=wellarchitected-wp)<br><br>
1561
1562**Video**<br>
1563
1564❁ [How do I manage my AWS service limits?](https://aws.amazon.com/premiumsupport/knowledge-center/manage-service-limits/?ref=wellarchitected-wp)<br>
1565❁ [Embracing Failure: Fault-Injection and Service Reliability](https://www.youtube.com/watch?v=wrY7XoOnysg&ref=wellarchitected-wp)<br><br>
1566
1567**Product**<br>
1568
1569❁ [AWS Premium Support](https://aws.amazon.com/premiumsupport/?ref=wellarchitected-wp)<br>
1570❁ [Trusted Advisor](https://aws.amazon.com/premiumsupport/trustedadvisor/?ref=wellarchitected-wp)<br><br>
1571
1572### ? Performance Efficiency
1573
1574The **Performance Efficiency** pillar includes `the ability to use
1575computing resources efficiently to meet system requirements, and to
1576maintain that efficiency as demand changes and technologies evolve.`<br><br>
1577
1578The performance efficiency pillar provides an overview of design
1579principles, best practices, and questions.<br>
1580
1581You can find prescriptive guidance on implementation in the
1582[Performance Efficiency Pillar whitepaper](https://d0.awsstatic.com/whitepapers/architecture/AWS-Performance-Efficiency-Pillar.pdf?ref=wellarchitected-wp).<br><br>
1583
1584#### ? A. Design Principles
1585
1586There are `five design principles` for performance efficiency in the cloud:<br>
1587
1588❆ **Democratize advanced technologies**:<br>
15891. Technologies that are difficult to implement can become easier to
1590consume by pushing that knowledge and complexity into the cloud vendor's 
1591domain.
15922. Rather than having your IT team learn how to host and run a new
1593technology, they can simply consume it as a service.
15943. For example, NoSQL databases, media transcoding, and machine learning
1595are all technologies that require expertise that is not evenly
1596dispersed across the technical community.
15974. In the cloud, these technologies become services that your team can
1598consume while focusing on product development rather than resource
1599provisioning and management.<br><br>
1600
1601❆ **Go global in minutes**:<br>
16021. Easily deploy your system in multiple Regions around the world with
1603just a few clicks.
16042. This allows you to provide lower latency and a better experience for
1605your customers at minimal cost.<br><br>
1606
1607❆ **Use serverless architectures**:<br>
16081. In the cloud, serverless architectures remove the need for you to run
1609and maintain servers to carry out traditional compute activities.
16102. For example, storage services can act as static websites, removing
1611the need for web servers, and event services can host your code for you.
16123. This not only removes the operational burden of managing these
1613servers, but can also lower transactional costs because these managed
1614services operate at cloud scale.<br><br>
1615
1616❆ **Experiment more often**:<br>
16171. With virtual and automatable resources, you can quickly carry out
1618comparative testing using different types of instances, storage, or
1619configurations.<br><br>
1620
1621❆ **Mechanical sympathy**:<br>
16221. Use the technology approach that aligns best to what you are trying 
1623to achieve.
16242. For example, consider data access patterns when selecting database or
1625storage approaches.<br><br>
1626
1627#### ? B. Definition
1628
1629There are `four best practice areas for performancy efficiency in the
1630cloud`:<br>
1631
1632⌖ **Selection**<br>
1633
1634⌖ **Review**<br>
1635
1636⌖ **Monitoring**<br>
1637
1638⌖ **Tradeoffs**<br><br>
1639
1640Take a data-driven approach to selecting a high-performance architecture.<br>
1641
1642Gather data on all aspects of the architecture, from the high-level
1643design to the selection and configuration of resource types.<br>
1644
1645By reviewing your choices on a cyclical basis, you will ensure that you
1646are taking advantage of the continually evolving AWS Cloud.<br>
1647
1648Monitoring will ensure that you are aware of any deviance from expected
1649performance and can take action on it.<br>
1650
1651Finally, your architecture can make tradeoffs to improve performance, 
1652such as using compression or caching, or relaxing consistency
1653requirements.<br><br>
1654
1655#### ? C. Best Practices
1656
1657#### ❶ *Selection*
1658
1659The optimal solution for a particular system will vary based on the kind
1660of workload you have, often with multiple approaches combined.<br>
1661
1662Well-architected systems use multiple solutions and enable different
1663features to improve performance.<br><br>
1664
1665In AWS, resources are virtualized and are available in a number of
1666different types and configurations.<br>
1667
1668This makes it easier to find an approach that closely matches your 
1669needs, and you can also find options that are not easily achievable with
1670on-premises infrastructure.<br>
1671
1672For example, a managed service such as Amazon DynamoDB provides a fully
1673managed NoSQL database with single-digit millisecond latency at any
1674scale.<br><br>
1675
1676The following questions focus on these considerations for performance
1677efficiency.<br>
1678(For a list of performance efficiency questions, answers, and best
1679practices, see the **Appendix**.)<br><br>
1680
1681? **PERF 1**: **How do you select the best performing architecture?**<br>
1682
1683> Often, multiple approaches are required to get optimal performance
1684across a workload. Well-architected systems use multiple solutions and
1685enable different features to improve performance. [(hello)]()<br>
1686<br>
1687
1688When you select the patterns and implementation for your architecture, 
1689use a data-driven approach for the most optimal solution.<br>
1690
1691AWS Solutions Architects, AWS Reference Architectures, and AWS Partner
1692Network (APN) Partners can help you select an architecture based on what
1693we have learned, but data obtained through benchmarking or load testing
1694will be required to optimize your architecture.<br><br>
1695
1696Your architecture will likely combine a number of different 
1697architectural approaches (for example, event-driven, ETL, or pipeline).<br>
1698
1699The implementation of your architecture will use the AWS services that
1700are specific to the optimization of your architecture's performance.<br>
1701
1702In the following sections, we look at the four main resource types that
1703you should consider (compute, storage, database, and network).<br><br>
1704
1705#### ? A. Compute
1706
1707The optimal compute solution for a particular system may vary based on
1708application design, usage patterns, and configuration settings.<br>
1709
1710Architectures may use different compute solutions for various components
1711and enable different features to improve performance.<br>
1712
1713Selecting the wrong compute solution for an architecture can lead to
1714lower performance efficiency.<br><br>
1715
1716**In AWS**, `compute is available in three forms`: instances, containers, and
1717functions:<br>
1718
1719* **Instances** are virtualized servers and, therefore, you can change
1720their capabilities with the click of a button or an API call.
1721
17221. Because in the cloud resource decisions are no longer fixed, you can
1723experiment with different server types.
17242. At AWS, these virtual server instances come in different families and
1725sizes, & they offer a wide variety of capabilities, including
1726solid-state drives (SSDs) and graphics processing units (GPUs).<br><br>
1727
1728* **Containers** are a method of operating system virtualization that
1729allow you to run an application and its dependencies in 
1730resource-isolated processes.<br><br>
1731
1732* **Functions** abstract the execution environment from the code you
1733want to execute. For example, AWS Lambda allows you to execute code
1734without running an instance.<br><br>
1735
1736The following questions focus on these considerations for performance
1737efficiency.<br><br>
1738
1739? **PERF 2**: **How do you select your compute solution?**<br>
1740
1741> The optimal compute solution for a system varies based on application
1742design, usage patterns, and configuration settings. Architectures may
1743use different compute solutions for various components and enable
1744different features to improve performance. Selecting the wrong compute
1745solution for an architecture can lead to lower performance efficiency.
1746[(hello)]()<br>
1747<br>
1748
1749When architecting, your use of compute should take advantage of the
1750elasticity mechanisms available to ensure you have sufficient capacity 
1751to sustain performance as demand changes.<br><br>
1752
1753#### ? B. Storage
1754
1755The optimal storage solution for a particular system will vary based on
1756the 
17571. Kind of access method (block, file, or object), 
17582. Patterns of access (random or sequential), 
17593. Throughput required, 
17604. Frequency of access (online, offline, archival), 
17615. Frequency of update (WORM, dynamic), and
17626. Availability and durability constraints.<br>
1763
1764Well-architected systems use multiple storage solutions and enable
1765different features to improve performance.<br><br>
1766
1767In AWS, storage is virtualized and is available in a number of different
1768types.<br>
1769
1770This makes it easier to match your storage methods more closely with
1771your needs, and also offers storage options that are not easily
1772achievable with on-premises infrastructure.<br>
1773
1774For example, Amazon S3 is designed for 11 nines of durability.<br>
1775
1776You can also change from using magnetic hard disk drives (HDDs) to SSDs, 
1777and easily move virtual drives from one instance to another in seconds.<br><br>
1778
1779The following questions focus on these considerations for performance
1780efficiency.<br><br>
1781
1782? **PERF 3**: **How do you select your storage solution?**<br>
1783
1784> The optimal storage solution for a system varies on the kind of access
1785method (block, file, or object), patterns of access (random or 
1786sequential), required throughput, frequency of access (online, offline, 
1787archival), frequency of update (WORM, dynamic), & availability and
1788durability constraints. Well-architected systems use multiple storage
1789solutions and enable different features to improve performance and use
1790resources efficiently. [(hello)]()<br>
1791<br>
1792
1793When you select a storage solution, ensuring that it aligns with your
1794access patterns will be critical to achieving the performance you want.<br><br>
1795
1796#### ? C. Database
1797
1798The optimal database solution for a particular system can vary based on
1799requirements for availability, consistency, partition tolerance, 
1800latency, durability, scalability, and query capability.<br>
1801
1802Many systems use different database solutions for various subsystems and
1803enable different features to improve performance.<br>
1804
1805Selecting the wrong database solution and features for a system can lead
1806to lower performance efficiency.<br><br>
1807
1808Amazon RDS provides a fully managed relational database.<br>
1809
1810With Amazon RDS, you can scale your database's compute and storage
1811resources, often with no downtime.<br>
1812
1813Amazon DynamoDB is a fully managed NoSQL database that provides
1814single-digit millisecond latency at any scale.<br>
1815
1816Amazon Redshift is a managed petabyte-scale data warehouse that allows 
1817you to change the number or type of nodes as your performance or 
1818capacity needs change.<br><br>
1819
1820The following questions focus on these considerations for performance
1821efficiency.<br><br>
1822
1823? **PERF 4**: **How do you select your database solution?**<br>
1824
1825> The optimal database solution for a system varies based on
1826requirements for availability, consistency, partition tolerance, 
1827latency, durability, scalability, and query capability. Many systems use
1828different database solutions for various sub-systems and enable 
1829different features to improve performance. Selecting the wrong database
1830solution and features for a system can lead to lower performance efficiency. [(hello)]()<br>
1831<br>
1832
1833Although a workload's database approach (RDBMS, NoSQL) has significant
1834impact on performance efficiency, it is often an area that is chosen
1835according to organizational defaults rather than through a data-driven
1836approach.<br>
1837
1838As with storage, it is critical to consider the access patterns of your
1839workload, and also to consider if other non-database solutions could
1840solve the problem more efficiently (such as using a search engine or
1841data warehouse).<br><br>
1842
1843####  ? D. Network
1844
1845The optimal network solution for a particular system will vary based on
1846latency, throughput requirements and so on.<br>
1847
1848Physical constraints such as user or on-premises resources will drive
1849location options, which can be offset using edge techniques or resource
1850placement.<br><br>
1851
1852In AWS, networking is virtualized and is available in a number of 
1853different types and configurations.<br>
1854
1855This makes it easier to match your networking methods more closely with
1856your needs.<br>
1857
1858AWS offers product features (for example, Enhanced Networking, Amazon
1859EBS-optimized instances, Amazon S3 transfer acceleration, dynamic Amazon
1860CloudFront) to optimize network traffic.<br>
1861
1862AWS also offers networking features (for example, Amazon Route53 latency
1863routing, Amazon VPC endpoints, and AWS Direct Connect) to reduce network
1864distance or jitter.<br><br>
1865
1866The following questions focus on these considerations for performance
1867efficiency.<br><br>
1868
1869? **PERF 5**: **How do you configure your networking solution?**<br>
1870
1871> The optimal network solution for a system varies based on latency, 
1872throughput requirements, and so on. Physical constraints such as user or
1873on-premises resources drive location options, which can be offset using
1874edge techniques or resource placement. [(hello)]()<br>
1875<br>
1876
1877When selecting your network solution, you need to consider location.<br>
1878
1879With AWS, you can choose to place resources close to where they will be
1880used to reduce distance.<br>
1881
1882By taking advantage of Regions, placement groups, and edge locations you
1883can significantly improve performance.<br><br>
1884
1885#### ❷ *Review*
1886
1887When architecting solutions, there is a finite set of options that you
1888can choose from.<br>
1889
1890However, over time new technologies and approaches become available that
1891could improve the performance of your architecture.<br><br>
1892
1893Using AWS, you can take advantage of our continual innovation, which is
1894driven by customer need.<br>
1895
1896We release new Regions, edge locations, services, and features regularly.<br>
1897
1898Any of these could positively improve the performance efficiency of your
1899architecture.<br><br>
1900
1901The following questions focus on these considerations for performance
1902efficiency.<br><br>
1903
1904? **PERF 6**: **How do you evolve your workload to take advantage of
1905new releases?**<br>
1906
1907> When architecting workloads, there are finite options that you can
1908choose from. However, over time, new technologies and approaches become
1909available that could improve the performance of your workload. [(hello)]()<br>
1910<br>
1911
1912Understanding where your architecture is performance-constrained will
1913allow you to look out for releases that could alleviate that constraint.<br><br>
1914
1915#### ❸ *Monitoring*
1916
1917After you have implemented your architecture, you will need to monitor
1918its performance so that you can remediate any issues before your
1919customers are aware.<br>
1920
1921Monitoring metrics should be used to raise alarms when thresholds are
1922breached.<br>
1923
1924The alarm can trigger automated action to work around any badly
1925performing components.<br><br>
1926
1927Amazon CloudWatch provides the ability to monitor and send notification
1928alarms.<br>
1929
1930You can use automation to work around performance issues by triggering
1931actions through Amazon Kinesis, Amazon Simple Queue Service (Amazon SQS),
1932and AWS Lambda.<br><br>
1933
1934The following questions focus on these considerations for performance
1935efficiency.<br><br>
1936
1937? **PERF 7**: **How do you monitor your resources to ensure they are
1938performing as expected?**<br>
1939
1940> System performance can degrade over time. Monitor system performance
1941to identify this degradation and remediate internal or external factors, 
1942such as the operating system or application load. [(hello)]()<br>
1943<br>
1944
1945Ensuring that you do not see too many false positives, or are
1946overwhelmed with data, is key to having an effective monitoring solution.<br>
1947
1948Automated triggers avoid human error and can reduce the time to fix
1949problems.<br>
1950
1951Plan for game days where you can conduct simulations in the production
1952environment to test your alarm solution and ensure that it correctly
1953recognizes issues.<br><br>
1954
1955#### ❹ *Tradeoffs*
1956
1957When you architect solutions, think about tradeoffs so you can select an
1958optimal approach.<br>
1959
1960Depending on your situation, you could trade consistency, durability, 
1961and space versus time or latency to deliver higher performance.<br><br>
1962
1963Using AWS, you can go global in minutes and deploy resources in multiple
1964locations across the globe to be closer to your end users.<br>
1965
1966You can also dynamically add read-only replicas to information stores 
1967such as database systems to reduce the load on the primary database.<br>
1968
1969AWS also offers caching solutions such as Amazon ElastiCache, which
1970provides an in-memory data store or cache, and Amazon CloudFront, which
1971caches copies of your static content closer to end users.<br>
1972
1973Amazon DynamoDB Accelerator (DAX) provides a read-through/write-through
1974distributed caching tier in front of DynamoDB, supporting the same API,
1975but providing sub-millisecond latency for entities that are in the
1976cache.<br><br>
1977
1978The following questions focus on these considerations for performance
1979efficiency.<br><br>
1980
1981? **PERF 8**: **How do you use tradeoffs to improve performance?**<br>
1982
1983> When architecting solutions, actively considering tradeoffs enables
1984you to select an optimal approach. Often you can improve performance by
1985trading consistency, durability, and space for time and latency. [(hello)]()<br>
1986<br>
1987
1988Tradeoffs can increase the complexity of your architecture and require
1989load testing to ensure that a measurable benefit is obtained.<br><br>
1990
1991#### ? D. Key AWS Services
1992
1993The AWS service that is essential to Performance Efficiency is **Amazon
1994CloudWatch**, which `monitors your resources and systems, providing
1995visibility into your overall performance and operational health.`<br>
1996
1997The following services and features support the four areas in 
1998performance efficiency:
1999
2000* **Selection**:
2001
2002✱ **Compute**:<br>
20031. **Auto Scaling** is key to ensuring that you have enough instances to
2004meet demand and maintain responsiveness.<br>
2005
2006✱ **Storage**:<br>
20071. **Amazon EBS** provides a wide range of storage options (such as SSD 
2008and provisioned input/output operations per second (PIOPS)) that allow 
2009you to optimize for your use case.
20102. **Amazon S3** provides serverless content delivery, and **Amazon S3 
2011transfer acceleration** enables fast, easy, and secure transfers of 
2012files over long distances.<br>
2013
2014✱ **Database**:<br>
20151. **Amazon RDS** provides a wide range of database features (such as 
2016PIOPS and read replicas) that allow you to optimize for your use case.
20172. **Amazon DynamoDB** provides single-digit millisecond latency at any
2018scale.<br>
2019
2020✱ **Network**:<br>
20211. **Amazon Route53** provides latency-based routing.
20222. **Amazon VPC** endpoints and **AWS Direct Connect** can reduce 
2023network distance or jitter.<br><br>
2024
2025* **Review**:
20261. The **AWS Blog** and the **What's New section** on the AWS website 
2027are resources for learning about newly launched features and services.<br><br>
2028
2029* **Monitoring**:
20301. **Amazon CloudWatch** provides metrics, alarms, and notifications 
2031that you can integrate with your existing monitoring solution, and that you can use with **AWS Lambda** to trigger actions.<br><br>
2032
2033* **Tradeoffs**:
20341. **Amazon ElastiCache, Amazon CloudFront, and AWS Snowball** are 
2035services that allow you to improve performance.
20362. Read replicas in **Amazon RDS** can allow you to scale read-heavy workloads.<br><br>
2037
2038#### ? E. Resources
2039
2040Refer to the following resources to learn more about our best practices
2041for Performance Efficiency.<br>
2042
2043**Documentation**
2044
2045❁ [Amazon S3 Performance Optimization](http://docs.aws.amazon.com/AmazonS3/latest/dev/PerformanceOptimization.html?ref=wellarchitected-wp)<br>
2046❁ [Amazon EBS Volume Performance](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html?ref=wellarchitected-wp)<br><br>
2047
2048**Whitepaper**
2049
2050❁ [Performance Efficiency Pillar](https://d0.awsstatic.com/whitepapers/architecture/AWS-Performance-Efficiency-Pillar.pdf?ref=wellarchitected-wp)<br><br>
2051
2052**Video**
2053
2054❁ [AWS re:Invent 2016: Scaling Up to Your First 10 Million Users (ARC 201)](https://www.youtube.com/watch?v=n28lDDdlnVg&ref=wellarchitected-wp)<br>
2055❁ [AWS re:Invent 2017: Deep Dive on Amazon EC2 Instances](https://www.youtube.com/watch?v=mZy6E2I5Rek&ref=wellarchitected-wp)<br><br>
2056
2057### ⚖ Cost Optimization
2058
2059The **Cost Optimization** pillar includes `the ability to run systems to
2060deliver business value at the lowest price point.`<br><br>
2061
2062The cost optimization pillar provides an overview of design principles, 
2063best practices, and questions.<br>
2064
2065You can find prescriptive guidance on implementation in the
2066[Cost Optimization Pillar whitepaper](https://d0.awsstatic.com/whitepapers/architecture/AWS-Cost-Optimization-Pillar.pdf?ref=wellarchitected-wp).<br><br>
2067
2068#### ? A. Design Principles
2069
2070There are `five design principles` for cost optimization in the cloud:<br>
2071
2072❆ **Adopt a consumption model**:
20731. Pay only for the computing resources that you require and increase or
2074decrease usage depending on business requirements, not by using 
2075elaborate forecasting.
20762. For example, development and test environments are typically only
2077used for eight hours a day during the work week.
20783. You can stop these resources when they are not in use for a potential
2079cost savings of 75% (40 hours versus 168 hours).<br><br>
2080
2081❆ **Measure overall efficiency**:
20821. Measure the business output of the workload and the costs associated
2083with delivering it.
20842. Use this measure to know the gains you make from increasing output
2085and reducing costs.<br><br>
2086
2087❆ **Stop spending money on data center operations**:
20881. AWS does the heavy lifting of racking, stacking, and powering 
2089servers, so you can focus on your customers and organization projects
2090rather than on IT infrastructure.<br><br>
2091
2092❆ **Analyze and attribute expenditure**:
20931. The cloud makes it easier to accurately identify the usage and cost
2094of systems, which then allows transparent attribution of IT costs to
2095individual workload owners.
20962. This helps measure return on investment (ROI) and gives workload
2097owners an opportunity to optimize their resources and reduce costs.<br><br>
2098
2099❆ **Use managed and application level services to reduce cost of
2100ownership**:
21011. In the cloud, managed and application level services remove the
2102operational burden of maintaining servers for tasks such as sending 
2103email or managing databases.
21042. As managed services operate at cloud scale, they can offer a lower
2105cost per transaction or service.<br><br>
2106
2107#### ? B. Definition
2108
2109There are `four best practice areas for cost optimization in the cloud:`<br>
2110
2111⌖ **Expenditure Awareness**<br>
2112
2113⌖ **Cost-Effective Resources**<br>
2114
2115⌖ **Matching supply and demand**<br>
2116
2117⌖ **Optimizing Over Time**<br><br>
2118
2119As with the other pillars, there are tradeoffs to consider. For example, 
2120do you want to prioritize for speed to market or for cost?<br>
2121
2122In some cases, it's best to prioritize for speed — going to market
2123quickly, shipping new features, or simply meeting a deadline — rather
2124than investing in upfront cost optimization.<br>
2125
2126Design decisions are sometimes guided by haste as opposed to empirical
2127data, as the temptation always exists to overcompensate "just in case"
2128rather than spend time benchmarking for the most cost-optimal workload
2129over time.<br>
2130
2131This often leads to drastically over-provisioned and under-optimized
2132deployments, which remain static throughout their life cycle.<br>
2133
2134The following sections provide techniques and strategic guidance for the
2135initial and ongoing cost optimization of your deployment.<br><br>
2136
2137#### ? C. Best Practices
2138
2139#### ❶ *Expenditure Awareness*
2140
2141The increased flexibility and agility that the cloud enables encourages
2142innovation and fast-paced development and deployment.<br>
2143
2144It eliminates the manual processes and time associated with provisioning
2145on-premises infrastructure, including identifying hardware 
2146specifications, negotiating price quotations, managing purchase orders,
2147scheduling shipments, and then deploying the resources.<br>
2148
2149However, the ease of use and virtually unlimited on-demand capacity
2150requires a new way of thinking about expenditures.<br><br>
2151
2152Many businesses are composed of multiple systems run by various teams.<br>
2153
2154The capability to attribute resource costs to the individual 
2155organization or product owners drives efficient usage behavior and helps
2156reduce waste.<br>
2157
2158Accurate cost attribution allows you to know which products are truly
2159profitable, and allows you to make more informed decisions about where
2160to allocate budget.<br><br>
2161
2162In AWS you can use Cost Explorer to track your spend, and gain insights
2163into exactly where you spend.<br>
2164
2165Using AWS Budgets, you can send notifications if your usage or costs are
2166not inline with your forecasts.<br>
2167
2168You can use tagging on resources to apply business and organization
2169information to your usage and cost; this provides additional insights to
2170optimization from an organization perspective.<br><br>
2171
2172The following questions focus on these considerations for cost
2173optimization.<br>
2174(For a list of cost optimization questions, answers, and best practices, 
2175see the **Appendix**.)<br><br>
2176
2177? **COST 1**: **How do you govern usage?**<br>
2178
2179> Establish policies and mechanisms to ensure that appropriate costs are
2180incurred while objectives are achieved. By employing a 
2181checks-and-balances approach, you can innovate without overspending. [(hello)]()<br>
2182<br>
2183
2184? **COST 2**: **How do you monitor usage and cost?**<br>
2185
2186> Establish policies and procedures to monitor and appropriately
2187allocate your costs. This allows you to measure and improve the cost
2188efficiency of this workload. [(hello)]()<br>
2189<br>
2190
2191? **COST 3**: **How do you decommission resources?**<br>
2192
2193> Implement change control and resource management from project
2194inception to end-of-life. This ensures you shut down or terminate unused
2195resources to reduce waste. [(hello)]()<br>
2196<br>
2197
2198You can use cost allocation tags to categorize and track your AWS usage
2199and costs.<br>
2200
2201When you apply tags to your AWS resources (such as EC2 instances or S3 
2202buckets), AWS generates a cost and usage report with your usage and your
2203tags.<br>
2204
2205You can apply tags that represent organization categories (such as cost
2206centers, workload names, or owners) to organize your costs across
2207multiple services.<br><br>
2208
2209Combining tagged resources with entity lifecycle tracking (employees, 
2210projects) makes it possible to identify orphaned resources or projects
2211that are no longer generating value to the organization and should be
2212decommissioned.<br>
2213
2214You can set up billing alerts to notify you of predicted overspending, 
2215and the AWS Simple Monthly Calculator allows you to calculate your data
2216transfer costs.<br><br>
2217
2218#### ❷ *Cost-Effective Resources*
2219
2220Using the appropriate instances and resources for your workload is key
2221to cost savings.<br>
2222
2223For example, a reporting process might take five hours to run on a 
2224smaller server but one hour to run on a larger server that is twice as
2225expensive.<br>
2226
2227Both servers give you the same outcome, but the smaller server incurs
2228more cost over time.<br><br>
2229
2230A well-architected workload uses the most cost-effective resources,
2231which can have a significant and positive economic impact.<br>
2232
2233You also have the opportunity to use managed services to reduce costs.<br>
2234
2235For example, rather than maintaining servers to deliver email, you can
2236use a service that charges on a per-message basis.<br><br>
2237
2238AWS offers a variety of flexible and cost-effective pricing options to
2239acquire instances from EC2 and other services in a way that best fits
2240your needs.<br>
2241
2242*On-Demand Instances* allow you to pay for compute capacity by the hour,
2243with no minimum commitments required.<br>
2244
2245*Reserved Instances* allow you to reserve capacity and offer savings of
2246up to 75% off On-Demand pricing.<br>
2247
2248With *Spot Instances*, you can leverage unused Amazon EC2 capacity and
2249offer savings of up to 90% off On-Demand pricing.<br>
2250
2251Spot Instances are appropriate where the system can tolerate using a 
2252fleet of servers where individual servers can come and go dynamically,
2253such as stateless web servers, batch processing, or when using HPC and
2254big data.<br><br>
2255
2256Appropriate service selection can also reduce usage and costs; such as
2257CloudFront to minimize data transfer, or completely eliminate costs, 
2258such as utilizing Amazon Aurora on RDS to remove expensive database
2259licensing costs.<br><br>
2260
2261The following questions focus on these considerations for cost
2262optimization.<br><br>
2263
2264? **COST 4**: **How do you evaluate cost when you select services?**<br>
2265
2266> Amazon EC2, Amazon EBS, and Amazon S3 are building-block AWS services.
2267Managed services, such as Amazon RDS and Amazon DynamoDB, are higher
2268level, or application level, AWS services. By selecting the appropriate
2269building blocks and managed services, you can optimize this workload for
2270cost. For example, using managed services, you can reduce or remove much
2271of your administrative and operational overhead, freeing you to work on
2272applications and business-related activities. [(hello)]()<br>
2273<br>
2274
2275? **COST 5**: **How do you meet cost targets when you select resource
2276type and size?**<br>
2277
2278> Ensure that you choose the appropriate resource size for the task at
2279hand. By selecting the most cost effective type and size, you minimize
2280waste. [(hello)]()<br>
2281<br>
2282
2283? **COST 6**: **How do you use pricing models to reduce cost?**<br>
2284
2285> Use the pricing model that is most appropriate for your resources to
2286minimize expense. [(hello)]()<br>
2287<br>
2288
2289? **COST 7**: **How do you plan for data transfer charges?**<br>
2290
2291> Ensure that you plan and monitor data transfer charges so that you can
2292make architectural decisions to minimize costs. A small yet effective
2293architectural change can drastically reduce your operational costs over
2294time. [(hello)]()<br>
2295<br>
2296
2297By factoring in cost during service selection, and using tools such as
2298**Cost Explorer** and **AWS Trusted Advisor** to regularly review your 
2299AWS usage, you can actively monitor your utilization and adjust your
2300deployments accordingly.<br><br>
2301
2302#### ❸ *Matching supply and demand*
2303
2304Optimally matching supply to demand delivers the lowest cost for a 
2305workload, but there also needs to be sufficient extra supply to allow
2306for provisioning time and individual resource failures.<br>
2307
2308Demand can be fixed or variable, requiring metrics and automation to
2309ensure that management does not become a significant cost.<br><br>
2310
2311In AWS, you can automatically provision resources to match demand.<br>
2312
2313Auto Scaling and demand, buffer, & time-based approaches allow you to
2314add and remove resources as needed.<br>
2315
2316If you can anticipate changes in demand, you can save more money and
2317ensure your resources match your workload needs.<br><br>
2318
2319The following questions focus on these considerations for cost
2320optimization.<br><br>
2321
2322? **COST 8**: **How do you match supply of resources with demand?**<br>
2323
2324> For a workload that has balanced spend and performance, ensure that
2325everything you pay for is used and avoid significantly underutilizing
2326instances. A skewed utilization metric in either direction has an
2327adverse impact on your organization, in either operational costs 
2328(degraded performance due to over-utilization), or wasted AWS 
2329expenditures (due to over-provisioning). [(hello)]()<br>
2330<br>
2331
2332When designing to match supply against demand, actively think about the
2333patterns of usage and the time it takes to provision new resources.<br><br>
2334
2335#### ❹ *Optimizing Over Time*
2336
2337As AWS releases new services and features, it is a best practice to
2338review your existing architectural decisions to ensure they continue to
2339be the most cost-effective.<br>
2340
2341As your requirements change, be aggressive in decommissioning resources,
2342entire services, and systems that you no longer require.<br><br>
2343
2344Managed services from AWS can significantly optimize the workload, so it
2345is essential to be aware of new managed services and features as they
2346become available.<br>
2347
2348For example, running an Amazon RDS database can be cheaper than running
2349your own database on Amazon EC2.<br><br>
2350
2351The following questions focus on these considerations for cost
2352optimization.<br><br>
2353
2354? **COST 9**: **How do you evaluate new services?**<br>
2355
2356> As AWS releases new services and features, it is a best practice to
2357review your existing architectural decisions to ensure they continue to
2358be the most cost-effective. [(hello)]()<br>
2359<br>
2360
2361When regularly reviewing your deployments, assess how newer services can
2362help save you money.<br>
2363
2364For example, Amazon Aurora on RDS can reduce costs for relational 
2365databases.<br><br>
2366
2367#### ? D. Key AWS Services
2368
2369The tool that is essential to Cost Optimization is **Cost Explorer**,
2370which `helps you gain visibility and insights into your usage, across
2371your workloads and throughout your organization.`<br>
2372
2373The following services and features support the four areas in cost
2374optimization:<br>
2375
2376* **Expenditure Awareness**:
23771. **AWS Cost Explorer** allows you to view and track your usage in
2378detail.
23792. **AWS Budgets** notify you if your usage or spend exceeds actual or
2380forecast budgeted amounts.<br><br>
2381
2382* **Cost-Effective Resources**:
23831. You can use **Cost Explorer** for Reserved Instance recommendations, 
2384and see patterns in how much you spend on AWS resources over time.
23852. Use **Amazon CloudWatch and Trusted Advisor** to help right size your
2386resources.
23873. You can use **Amazon Aurora on RDS** to remove database licensing costs.
23884. **AWS Direct Connect and Amazon CloudFront** can be used to optimize
2389data transfer.<br><br>
2390
2391* **Matching supply and demand**:
23921. **Auto Scaling** allows you to add or remove resources to match
2393demand without overspending.<br><br>
2394
2395* **Optimizing Over Time**:
23961. The **AWS News Blog** and the **What's New section** on the AWS
2397website are resources for learning about newly launched features and
2398services.
23992. **AWS Trusted Advisor** inspects your AWS environment and finds
2400opportunities to save you money by eliminating unused or idle resources
2401or committing to Reserved Instance capacity.<br><br>
2402
2403#### ? E. Resources
2404
2405Refer to the following resources to learn more about our best practices
2406for Cost Optimization.<br>
2407
2408**Documentation**<br>
2409
2410❁ [Analyzing Your Costs with Cost Explorer](http://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-explorer-what-is.html?ref=wellarchitected-wp)<br>
2411❁ [AWS Cloud Economics Center](https://aws.amazon.com/economics/?ref=wellarchitected-wp)<br>
2412❁ [AWS Detailed Billing Reports](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-reports-costusage.html?ref=wellarchitected-wp)<br><br>
2413
2414**Whitepaper**<br>
2415
2416❁ [Cost Optimization Pillar](https://d0.awsstatic.com/whitepapers/architecture/AWS-Cost-Optimization-Pillar.pdf?ref=wellarchitected-wp)<br><br>
2417
2418**Video**<br>
2419
2420❁ [Cost Optimization on AWS](https://www.youtube.com/watch?v=XQFweGjK_-o&ref=wellarchitected-wp)<br><br>
2421
2422**Tool**<br>
2423
2424❁ [AWS Total Cost of Ownership (TCO) Calculators](http://aws.amazon.com/tco-calculator?ref=wellarchitected-wp)<br>
2425❁ [AWS Simple Monthly Calculator](http://calculator.s3.amazonaws.com/index.html?ref=wellarchitected-wp)<br><br>