· 6 years ago · May 09, 2019, 07:20 AM
1CHAPTER 6
2Database administration is a business function that involves managing a database in order to maximize its value to an organization. The conflicting goals of protecting the database and maximizing its availability and benefit to users must be balanced using good administration. All databases need database administration. The database administration for small, personal databases is informal; database administration for large, multiuser databases can involve an office and many people. DBA can stand for database administration or database administrator. Three basic database administration functions are necessary: concurrency control, security, and backup and recovery. The goal of concurrency control is to ensure that one user’s work does not inappropriately influence another user’s work. No single concurrency control technique is ideal for all circumstances. Trade-offs need to be made between the level of protection and data throughput. A transaction, or logical unit of work, is a series of actions taken against a database that occur as an atomic unit; either all of them occur or none of them do. The activity of concurrent transactions is interleaved on the database server. In some cases, updates can be lost if concurrent transactions are not controlled. Another concurrency problem concerns inconsistent reads. A dirty read occurs when one transaction reads a changed record that has not been committed to the database. A nonrepeatable read occurs when one transaction rereads data it has previously read and finds modifications or deletions caused by another transaction. A phantom read occurs when a transaction rereads data and finds new rows that were inserted by a different transaction. To avoid concurrency problems, database elements are locked. Implicit locks are placed by the DBMS; explicit locks are issued by the application program. The size of a locked resource is called lock granularity. An exclusive lock prohibits other users from reading or updating the locked resource; a shared lock allows other users to read the locked resource but not to update it. Two transactions that run concurrently and generate results that are consistent with the results that would have occurred if the transactions had run separately are referred to as serializable transactions. Two-phase locking, in which locks are acquired in a growing phase and released in a shrinking phase, is one scheme for serializability. A special case of two-phase locking is to acquire locks throughout the transaction but not to free any lock until the transaction is finished. Deadlock, or the deadly embrace, occurs when two transactions are each waiting on a resource that the other transaction holds. Deadlock can be prevented by requiring transactions to acquire all locks at the same time. When deadlock occurs, the only way to cure it is to abort one of the transactions and back out of partially completed work. Optimistic locking assumes that no transaction conflict will occur and then deals with the consequences if it does. Pessimistic locking assumes that conflict will occur and so prevents it ahead of time with locks. In general, optimistic locking is preferred for the Internet and for many intranet applications. Most application programs do not explicitly declare locks. Instead, they mark transaction boundaries with SQL transaction control statements—such as BEGIN, COMMIT, and ROLLBACK statements—and declare the concurrent behavior they want. The DBMS then places locks for the application that will result in the desired behavior. An ACID transaction is one that is atomic, consistent, isolated, and durable. Durable means that database changes are permanent. Consistency can refer to either statement-level or transaction-level consistency. With transaction-level consistency, a transaction may not see its own changes. The three types of data read problems that can occur are dirty read, nonrepeatable read, and phantom read. These problems are summarized in Figure 6-11. The 1992 SQL standard defines four transaction isolation levels: read uncommitted, read committed, repeatable read, and serializable. The characteristics of each are summarized in Figure 6-12. A cursor is a pointer into a set of records. Four cursor types are prevalent: forward only, static, keyset, and dynamic. Developers should select isolation levels and cursor types that are appropriate for their application workload and for the DBMS product in use. The goal of database security is to ensure that only authorized users can perform authorized activities at authorized times. To develop effective database security, the processing rights and responsibilities of all users must be determined. DBMS products provide security facilities. Most involve the declaration of users, groups, objects to be protected, and permissions or privileges on those objects. Almost all DBMS products use some form of user name and password security. DBMS security can be augmented by application security. In the event of system failure, the database must be restored to a usable state as soon as possible. Transactions in process at the time of the failure must be reapplied or restarted. Although in some cases recovery can be done by reprocessing, the use of logs and beforeimages and after-images with rollback and rollforward is almost always preferred. Checkpoints can be made to reduce the amount of work that needs to be done after a failure. In addition to concurrency control, security, and backup and recovery, a DBA needs to ensure that a system exists to gather and record errors and problems. The DBA works with the development team to resolve such problems on a prioritized basis and also to evaluate features and functions of new releases of the DBMS. In addition, the DBA needs to create and manage a process for controlling the database configuration so that changes to the database structure are made with a community-wide view. Finally, the DBA is responsible for ensuring that appropriate documentation is maintained about database structure, concurrency control, security, backup and recovery, and other details that concern the management and use of the database.
3
4CHAPTER 8
5
6
71. What are BI systems? Business Intelligence systems are information systems that assist managers and other professionals in the analysis of current and past activities and in the prediction of future events.
84. What are the three sources of data for BI systems? Operational databases, other internal data, and external data
9
106. What is an ETL system, and what functions does it perform? Extract, Transform, and Load system; it cleans and prepares the data for BI processing.
11
129. Why are data warehouses necessary? It provides extracts of its data to BI tools, such as data mining programs.
1311. Explain the difference between a data warehouse and a data mart. Give an example other than the ones used in this book. The data warehouse takes data from the data manufacturers, cleans and processes them, and locates the data of the data warehouse; a data mart is a collection of data that is smaller than that in the data warehouse and that addresses a particular component or functional area of the business.
1413. Describe the differences between operational databases and dimensional databases. An operational database is a system in use for the operations of the enterprise typically the OLTP system. The dimensional database is a database design that is used for data warehouses and is designed for efficient queries and analysis.
15
1618. What is slowly changing dimension? When data changes over time.
17
1821. What does OLAP stand for? Online Analytical Processing
19
2022. What is the distinguishing characteristic of OLAP reports? It is dynamic; the format of an OLAP report can be changed by the viewer, hence the term online in the name Online Analytical Processing.
2125. What is drill down? To further divide that data into more detail.
22
238.27 Define distributed database.
24A distributed database is a database that:
25(1) Has been split into sections called partitions, and has the partitions stored on different computers, or
26(2) Has copies of the database called replicas stored on different computers, or
27(3) Has been both partitioned and replicated.
28
29
308.28 Explain one way to partition a database that has three tables: T1, T2, and T3.
31Assume we have three database servers—S1, S2 and S3. Partition the database by putting one table on each server:
32⦠Server S1 holds table T1
33⦠Server S2 holds table T2
34⦠Server S3 holds table T3
35
368.29 Explain one way to replicate a database that has three tables: T1, T2, and T3.
37Assume we have three database servers—S1, S2 and S3. Replicate the database by putting all three tables on each server:
38⦠Server S1 holds table T1, T2 and T3
39⦠Server S2 holds table T1, T2 and T3
40⦠Server S3 holds table T1, T2 and T3
41
428.30 Explain what must be done when fully replicating a database but allowing only one computer to process updates.
43A shared lock would be useful in the case.
44
458.31 If more than one computer can update a replicated database, what three problems can occur?
46If more than one computer can update a replicated database, then:
47⦠There may be inconsistent updates, and
48⦠One computer may delete a record that another computer is updating, and
49⦠Changes may be made that violate uniqueness constraints.
50
51
528.33 Explain what problems can occur in a distributed database that is partitioned but not replicated.
53Portions of the data can be lost because the database is not fully replicated only seperated into pieces.
54
558.40 What is Big Data?
56Extremely large data sets such as analytics generated by search engines.
57
588.42 What is the NoSQL movement?
59Or "Not only" SQL is a movement to non-relational databases
60
618.43 What was the first nonrelational data store to be developed, and who developed it?
62“NoSQL†was in fact first used by Carlo Strozzi in 1998
63
648.48 What is virtualization?
65is using hardware and software to simulate another hardware resource
66
678.50 What is cloud computing? What major technology enables cloud computing?
68Cloud Computing is have distant servers do work for you remotely, it requires the internet typically.
69
708.51 What are the differences between SaaS, PaaS, and IaaS?
71⦠Three basic ways to lease cloud services:
72⦠Software as a service (S a a S), Example: Salesforce.com
73⦠Platform as a service (P a a S), Example: operating systems, software development tools & system programs provided
74⦠Infrastructure as a service (I a a S), Example: Only hardware provided and users manage their own software
75
76This chapter introduced a number of concepts that go beyond the “standard†setting of an operational relational database housed on a single computer. Databases can support more than standard SQL querying, in the form of BI reporting and data mining. At the same time, the sheer volume and complexity of modern data management have led to a resurrection and expansion of ideas from distributed computing and object databases, now used in DBMS processing in the form of cloud/virtualization and NoSQL databases. Business intelligence (BI) systems assist managers and other professionals in the analysis of current and past activities and in the prediction of future events. BI applications are of two major types: reporting applications and data mining applications. Reporting applications make elementary calculations on data; data mining applications use sophisticated mathematical and statistical techniques. BI applications obtain data from three sources: operational databases, extracts of operational databases, and purchased data. A BI system sometimes has its own DBMS, which may or not be the operational DBMS. Characteristics of reporting and data mining applications are listed in Figure 8-4. Direct reading of operational databases is not feasible for any but the smallest and simplest BI applications and databases for several reasons. Querying operational data can unacceptably slow the performance of operational systems; operational data have problems that limit their usefulness for BI applications; and BI system creation and maintenance require programs, facilities, and expertise that are normally not available for an operational database.Because of the problems with operational data, many organizations have chosen to create and staff data warehouses and data marts. Extract, transform, and load (ETL) systems are used to extract data from operational systems, transform the data and load them into data warehouses; and maintain metadata that describe the source, format, assumptions, and constraints of the data. A data mart is a collection of data that is smaller than that held in a data warehouse and addresses a particular component or functional area of the business. Operational databases and dimensional (data warehouse) databases have different characteristics, as shown in Figure 8-7. Dimensional databases use a star schema and must deal with slowly changing dimensions, so a time dimension is important. Fact tables hold measures of interest, and dimension tables hold attribute values used in queries. The star schema can be extended with additional fact tables, dimension tables, and conformed dimensions. The purpose of a reporting system is to create meaningful information from disparate data sources and to deliver that information to the proper users on a timely basis. Reports are produced by sorting, filtering, grouping, and making simple calculations on the data. RFM analysis is a typical reporting application. An RFM report can be produced using SQL statements. Online analytical processing (OLAP) reporting applications enable users to dynamically restructure reports utilizing measures and dimensions. A measure is a data item of interest. A dimension is a characteristic of a measure. An OLAP report, or OLAP cube, is an arrangement of measures and dimensions. With OLAP, users can drill down and exchange the order of dimensions. A distributed database is a database that is stored and processed on more than one computer. Variations of distributed databases include replicated and partitioned databases. A replicated database is one in which multiple copies of some or all of the database are stored on different computers. A partitioned database is one in which different pieces of the database are stored on different computers. A distributed database can include both replication and partitioning. Distributed databases pose processing challenges. If a database is updated on a single computer, then the challenge is simply to ensure that the copies of the database are logically consistent when they are distributed. However, if updates are to be made on more than one computer, the challenges become significant. If the database is partitioned and not replicated, then challenges occur if transactions span data on more than one computer. If the database is replicated and if updates occur to the replicated portions, then a special locking algorithm called distributed two-phase locking is required. Implementing this algorithm can be difficult and expensive. Objects consist of methods and properties or data values. All objects of a given class have the same methods, but they can have different property values. Object persistence is the process of storing object property values on disk. Relational databases are difficult to use for object persistence. Some specialized products called object-oriented DBMSs were developed in the 1990s but never received large-scale commercial acceptance. Oracle and others, following the SQL standard, have extended the capabilities of their relational DBMS products to provide support for object persistence. Such databases are referred to as object-relational databases. The physical setting for much NoSQL and Big Data work is often in the cloud and/or on virtual machines, which allows us to consolidate several logical servers into one larger physical one and to provide tremendous flexibility in dynamically provisioning servers, storage, and network resources. A special program called a hypervisor provides the virtual environment and manages the virtual machines. Cloud computing allows remote computers to host data, software, or both, taking advantage of the Internet to provide availability and scalability. Thus, portions of data centers can be leased by customers who are charged only for the resources they use. The NoSQL movement (now often read as “Not only SQLâ€) is built upon the need to meet the Big Data storage needs of companies such as Amazon.com, Google, and Facebook. These systems typically make use of cloud technology (derived in part from earlier work on distributed databases) and complex structuring techniques (derived in part from earlier work on object databases). The tools used to do this are nonrelational DBMSs, sometimes referred to as structured storage or NoSQL DBMSs. An early example was Bigtable, and a more recent popular example is Cassandra, a column family DBMS. Column family products use a non-normalized table structure built on columns, super columns, column families, and super column families tied together by RowKey values from a keyspace. Other varieties of NoSQL DBMS include key-value, document, and graph DBMSs, all of which are described in more detail in Appendix K. Data processing of the very large datasets found in Big Data is often done by the MapReduce process, which breaks a data processing task into many parallel tasks performed by many computers in the cluster and then combines these partial results to produce a final result. An emerging product that is supported by Microsoft and Oracle Corporation is the Hadoop Distributed File System (HDFS), with its spinoffs HBase, a nonrelational storage component, and Pig, a query language.