· 6 years ago · Mar 15, 2019, 05:54 PM
1---
2title: 'Lab 6: Two Sample EDA & QQ Plots'
3date: "STAT 250 - Spring 2019"
4output:
5 html_document: default
6---
7
8
9### Name: Vicki Meraz
10
11##### How each person contributed to the lab: I worked on it independently
12
13## Skills
14
15* Identify the explanatory and response variables in an experiment or observational study
16* Identify the type of study and its parts
17* Identify areas of randomization in a study
18* Be able to produce appropriate graphics and summary statistics for different types of data
19* Be able to interpret data from a boxplot and histogram
20* Understand how a QQ plot is created and why it can assess normality
21* Be able to assess normality of a sample based on the QQ plot
22
23
24
25
26**Please make sure to show all R code and output after each question so that we can see your work.** Write a sentence for each numerical value produced describing its meaning **in context with the proper units**. Be sure to submit **both** the knit Word document and the RMarkdown (.Rmd) file inside your zipped project to get full credit on your lab. (Remember, submitting a document that doesn't knit is preferable to not submitting anything!)
27
28
29You may work in pairs from within the same section, but include both partner's names on the documents and write a short description of how each person contributed to the lab. Each person is responsible for submitting the documents to the assignment (i.e. each person in the pair submits the same documents to iLearn).
30
31Here are some symbols that you might find useful:
32$H_0$ $H_1$ $\sigma$ $\mu$ $\mu_0$ $\mu_{word}$ $\neq$
33
34***
35
36```{r setup, include=FALSE}
37knitr::opts_chunk$set(echo = TRUE, comment = "",warning = FALSE, message = FALSE)
38require(rmarkdown, quietly=TRUE)
39```
40
41
42
43
44## Part 1: Exploratory Data Analysis
45
46### Scenario 1.1: Gender Bias
47
48Gender bias stems from a perceived mismatch from an expected role or characteristics based on gender. Studies have shown that men and woman have unconscious gender biases against woman in traditionally male-dominated fields (such as the sciences) or characteristics (such as leadership qualities). These biases often cause equally qualified women to be seen as [less likable](https://www.youtube.com/watch?v=hK0Q8b6QhDo) or [less qualified](https://blogs.scientificamerican.com/unofficial-prognosis/study-shows-gender-bias-in-science-is-real-heres-why-it-matters/) than the men. (These links are to descriptions of two well-known studies, but there are plenty of other good resources).
49
50Researchers are interested if this gender bias exists in traditionally female-dominated jobs as well, such as teaching. Students are asked to watch a video of an animated classroom and rate the professor. Each student is randomly assigned to either of two animations; the videos are exactly the same except for the gender of the professor drawn. You have been asked to analyze the data for the researchers to determine if the female professor is rated more poorly, on a 1 to 7 scale (with 7 being the best), than the male professor.
51
521.1.1. Identify each variable of interest in our analysis and the type of variable for each by replacing the [placeholders] below.
53
54Variable | Explanatory or Response Variable? | Numeric or Categorical?
55----------------|------------------------------------|-------------------------
56Gender | Explanatory | categorical
57Professor rating | response | numeric
58
59
60
611.1.2. State the parameter(s) of interest in the study.
62
63> The true mean of rating of a female professor compared to the true mean rating of a male professor.
64
65
661.1.3. State the type of data collected in this study (Independent or Matched-pair?). Justify your response.
67
68> Independent type of data collected
69> Because each student is randomly assigned a different video it is the same summary statistics for the response variables it is just split by whether they watched the male animation or the female animation. The students are not paired with anything else other then the video they are watching and are not dependent on eachother.
70
71
721.1.4. We want to calculate summary statistics on the ratings for male professors and for female professors separately. To do so, we will need to subset the raw data file into two groups. The data has been read in and the first vector extracted for you. Modify the third line of code to extract out the female professor ratings.
73
74```{r 1.1 extract two vectors}
75bias<-read.csv("prof_ratings.csv", header=TRUE) # reads in the csv file
76male<-bias$Rating[bias$Gender=="Male"] # subsets raw male data and saves it as male
77fem<- bias$Rating[bias$Gender=="Female"] # subsets raw female data and saves it as fem
78```
79
80
811.1.5. Modify the code below to calculate the summary statistics for each population. Place your resulting values in the table below using in-line code, rounding to four decimal places as needed.
82
83```{r}
84(n.m<-length(male)) # displays sample size for students who watched male animation and names it as n.m
85(xbar.m<-mean(male)) # displays sample mean of rating for students who watched male animation and names it as xbar.m
86(var.m<-var(male)) # displays sample variance of rating for students who watched male animation and names it as var.m
87(s.m<-sd(male)) # displays sample standard deviation of rating for students who watched male animation and names it as s.m
88(n.f<-length(fem)) # displays sample size for students who watched female animation and names it as n.f
89(xbar.f<-mean(fem)) # displays sample mean of rating for students who watched female animation and names it as xbar.f
90(var.f<-var(fem)) # displays variance of rating for students who watched female animation and names it as var.f
91(s.f<-sd(fem)) # displays sample standard deviation of rating for students who watched female animation and names it as s.f
92```
93
94**Population** | $n$ | $\bar{x}$ | $s^2$ | $s$
95---------------|---------------|-------------------------------|----------------
96**Male** | `r n.m` | `r xbar.m` | `r var.m` | `r s.m`
97**Female** | `r n.f` | `r xbar.f` | `r var.f` | `r s.f`
98
99
100
1011.1.6. Create a boxplot of the data by modifying the code below. Notice the new symbol, `~`, which translates as "by" or "as a function of." It lets us graph one variable split by another (for a boxplot, a numeric variable split by a categorical variable).
102
103```{r}
104boxplot(bias$Rating~bias$Gender,
105 horizontal=TRUE,
106 xlab="rating of professors",
107 ylab="Gender of professors",
108 main="Boxplot for Gender bias on female Professors")
109```
110
111
1121.1.7. Based on your boxplot, what is the approximate median for female rating and for male rating?
113
114> The median for female professors is about 4 and the median for male professors is about 5.
115
116
1171.1.8. Based on your boxplot, what can you conclude about the amount of variability in rating of male versus female professors? (Justify your answer)
118
119> Based on my box plot the IQR is approximately equal and therefoe the variability is about the same.
120
1211.1.9. Based on your *boxplot*, assess whether each distribution is normal. Justify your responses.
122
123> The distribution based on my box plots appear normal because there is not an aparent skew and the median and the mean are about the same for each box plot.
124
125
126
127### Scenario 1.2: Fat Bears
128
129Katmai National Park is famous for the salmon run each summer and the brown bear feast that accompanies it. Brown bears can lose roughly 30% of their body weight while hibernating, so it is essential that they pack on the pounds during the summer. They feed on high-caloric foods, like sockeye salmon (one salmon can be ~4000 calories), but also on sedges (a grass, often eating about 30 lbs of sedges a day), berries, grubs, etc. Over the course of the summer, bears have to add hundreds of pounds to their body weight to make up for weight lost during the past hibernation, and to survive the next hibernation; at peak season, brown bears can add up to four pounds of weight per day! Over the summer, the brown bears can really pack it on! (case in point: see before and after pictures of the [Katmai Fat Bear finalists](https://www.npr.org/2018/10/10/655937903/alaska-national-park-declares-fattest-bear-winner))
130
131You are a researcher interested in the weight gain of brown bears over the summer. You weigh a random selection of brown bears at the beginning of June, and then weigh the same bears at the end of September and record their weight (in pounds).
132
133This code reads in our data and then prints the vector names.
134```{r}
135bears<-read.csv("BearWeight.csv", header=TRUE) # reads in csv file
136names(bears) # displays variable names
137```
138
139
1401.2.1. Identify each variable of interest in our analysis and the type of variable for each by replacing the [placeholders] below.
141
142Variable | Explanatory or Response Variable? | Numeric or Categorical?
143----------------|------------------------------------|-------------------------
144season | explanatory | categorical
145weight | response | numeric
146
147
148
1491.2.2. State the parameter(s) of interest in the study.
150
151> The true mean difference in weight of the same bear after summer and before summer.
152
153
1541.2.3. State the type of data collected in this study (Independent or Matched-pair?). Justify your response.
155
156> Matched-pair
157> It is matched pair because the bear before summer is compared to itself after summer, so the bears chosen after summer are dependent on the bears that were chosen before summer and are matched for comparison
158
159
1601.2.4. We want to calculate summary statistics on the weight *gained* over the summer. Write a line of code below that will create a new vector for weight gained, using the square brackets to subset the data.
161
162```{r}
163gain<-bears$Weight[bears$Season=="September"]-bears$Weight[bears$Season=="June"] # gives the difference for after summer weight and before summer weight of bears and subsets this difference and names it as gain
164```
165
166
1671.2.5. Modify the code below to calculate the summary statistics (listed below) for *differences* in the weight gained over the summer. Place your resulting values in the table below, rounding to four decimal places as needed.
168
169```{r}
170(n.gain<-length(gain)) # displays sample size for the difference in bears weight after summer and before summer names it as n.gain
171(xbar.gain<-mean(gain)) # displays sample mean for the difference in bears weight after summer and before summer names it as xbar.gain
172(var.gain<-var(gain)) # displays sample variance for the difference in bears weight after summer and before summer names it as var.gain
173(s.gain<-sd(gain)) # displays sample standard deviation for the difference in bears weight after summer and before summer names it as s.gain
174```
175
176
177Symbols | Weight~diff~
178------------|---------------
179$n$ | `r n.gain`
180$\bar{x}_D$ | `r xbar.gain`
181$s_D^2$ | `r var.gain`
182$s_D$ | `r s.gain`
183
184
185
1861.2.6. Create a boxplot for the weight gained over the summer. Be sure to label it appropriately.
187
188
189```{r}
190boxplot(gain, horizontal = TRUE, main="Bears weight gained over summer", xlab="weight in pounds")
191```
192
193
194
1951.2.7. Describe the distribution of weight gained over summer in brown bears based on your boxplot.
196
197> Based on the boxplox the distribution of brown bears weight gained over summer looks approximately normal with perhaps a very slight right skew but
198
199
200Want more Fat Bear before and after pictures? Check out the photo in your zip file, or Katmai's [feed](https://twitter.com/KatmaiNPS) from October.
201
202
203
204
205### Scenario 1.3: Burrowing Owls & Dung Beetles
206
207Burrowing owls (*Athene cunicularia*) place mammalian dung in and around their burrows. If removed, it may be rapidly replaced, suggesting that it is more than an incidental accumulation of debris. Dung beetles are one of the most common types of prey items for burrowing owls (the owls are diurnal, as are the main species of dung beetle near these populations), which spend long periods of time standing motionless around their burrows, possibly waiting to snag tasty beetles wandering by. A research team wanted to know whether this dung actually attracted dung beetles or whether it had another use, such as to mask the odor of owl eggs from predators (Levey et al. 2004). To test the baiting hypothesis, the researchers randomly selected fifty burrows, and removed all dung, regurgitated pellets and beetle parts from the burrow entrances. They then randomly selected half of those to received a equal amount of cow dung, typical of the amount found at a burrow entrance; the remainder received no dung but were otherwise treated in the same way. After four days, they collected all prey remains and regurgitated pellets from the burrows and determined the number of beetles consumed by owls at each burrow.
208
209
210
2111.3.1. Insert a code chunk that will read in your csv file `BurrowingOwls.csv`, saves it as the object `owls`, and then print a list of the vector names to the screen. (*Note: You can __NOT__ use `View`*.)
212
213< Insert code chunk here >
214```{r}
215owls<-read.csv("BurrowingOwls.csv", header=TRUE) #reads in csv file for owl data
216names(owls) # dispalys the variable names
217```
218
219
220
2211.3.2. Identify each variable of interest in our analysis and the type of variable for each by replacing the [placeholders] below.
222
223Variable | Explanatory or Response Variable? | Numeric or Categorical?
224----------------|------------------------------------|-------------------------
225Dung | explanatory | categorical
226beetles | response | numeric
227
228
229
2301.3.3. State the parameter(s) of interest in the study.
231
232> The true mean amount of beetles eaten by owls with dung present
233> The true mean of beetles eaten by owls with out dung present
234
235
236
2371.3.4. State the type of data collected in this study (Independent or Matched-pair?). Justify your response.
238
239> Independent
240> Because the borrows that were treated with dung and with out dung were not dependent on eachother and we independently measured.
241
242
243
2441.3.5. We want to calculate summary statistics. Subset the data accordingly, and calculate the relevant summary statistics. Annotate your code using `#` to describe what each line of code is doing. Enter your values in the table below using in-line code. (*Think: what do you need to add to your code to use it in in-line code?*)
245
246```{r}
247w.d<-owls$Beetles[owls$Dung=="Present"] # subsets beetle data with dung and saves it as w.d
248wo.d<- owls$Beetles[owls$Dung=="Absent"] # subsets beetle data with out dung and saves it as wo.d
249(n.w.d<-length(w.d)) # displays sample size beetle data with dung and saves it as n.w.d
250(xbar.w.d<-mean(w.d)) # displays sample mean of beetle data with dung and saves it as xbar.w.d
251(var.w.d<-var(w.d)) # displays sample variance beetle data with dung and saves it as var.w.d
252(s.w.d<-sd(w.d)) # displays sample standard deviation of beetle data with dung and saves it as s.w.d
253(n.wo.d<-length(wo.d)) # displays sample size for beetle data without dung and saves it as n.wo.d
254(xbar.wo.d<-mean(wo.d)) # displays sample mean beetle data without dung and saves it as xbar.wo.d
255(var.wo.d<-var(wo.d)) # displays variance beetle data without dung and saves it as var.wo.d
256(s.wo.d<-sd(wo.d)) # displays sample standard deviation beetle data without dung and saves it as s.wo.d
257```
258
259
260
261Symbols | Dung Present | Dung Absent
262----------|---------------|---------------
263$n$ | `r n.w.d` | `r n.wo.d`
264$\bar{x}$ | `r xbar.w.d` | `r xbar.wo.d`
265$s^2$ | `r var.w.d` | `r var.wo.d`
266$s$ | `r s.w.d` | `r s.wo.d`
267
268
269
2701.3.6. Create two *comparable* histograms of your sample data from each population. Be sure to label it appropriately. You may find adding the following arguments helpful to making the histograms comparable: xlim, ylim, breaks. (*Don't recall how to use xlim or ylim? Try the hist() help page or your old labs!*) The argument `breaks` can be used in many ways, the easiest of which is to supply it one number for the number of bins you wish to have in the histogram.
271
272
273```{r}
274hist(w.d, xlim=c(0,8), ylim=c(0,10), main = "Amount of beetles eaten by owls with dung present at the burrow", xlab="number of beetles", breaks=5) #gives histogram with titles
275hist(wo.d, xlim=c(0,8), ylim=c(0,10), main = "Amount of beetles eaten by owls without dung present at the burrow", xlab="number of beetles", breaks=5) #gives histogram with titles
276```
277
278
279
2801.3.7. Create a boxplot of your sample data from each population in one single graphic. Be sure to label it appropriately.
281
282
283```{r}
284boxplot(owls$Beetles~owls$Dung,
285 horizontal=TRUE,
286 xlab="Amount of beetles",
287 ylab="Dung",
288 main="Amount of beetles eaten with dung varing variable")
289```
290
291
292
2931.3.8. Without conducting the hypothesis test, what are your conclusions about the differences of beetles consumed by burrowing owls based on the presence or absence of dung at the burrow entrance, based solely on the provided exploratory data analysis? Justify using the graphics and summary statistics.
294
295> Based on the sample taken on the amount of beetles eaten by owls with and with out dung on the outside of their burrow I can conclude that if there is dung present then there is more likelyhood that the owls are eating more beetles. The histograms and boxplots above support this. The boxplot clearly shows that 75% of the data that with dung present is greater then 75% of the data with out dung present and therefore dung increases amount of beetles being eatin and helps attract beetles to the owls burrows.
296
297
298
299***
300
301### Part 2: Testing for Normality from data
302
303
304If the assumptions for the t-test are not satisfied then we cannot trust the inferences we make from the hypothesis test or confidence interval. One of those assumptions is the normality of the populations from which we selected the samples. In order to trust the results of a t-test for means, we must be sure that the data is approximately normal, even for larger sample sizes, but especially for small sample sizes. We can test this using the QQ plot (Quantile-Quantile plot).
305
306Here is some observed data. We want to determine if it is a reasonable assumption to believe the data is from a normally distributed population variable.
307```{r}
308observed<-c(10.2,10.7,11.2,11.3,11.7,11.8,12,12.1,12.3,12.8,12.9,13.1)
309```
310First, we can use our usual visualizations to check the distribution of the data. Here is a histogram and overlay of the Normal curve to the data.
311```{r}
312hist(observed,main="Observed Values",xlab="Data Values",ylab="Probability",freq=FALSE,col="green")
313m<-mean(observed) #calculates mean of data
314std<-sqrt(var(observed)) #calculate sd of data
315curve(dnorm(x, mean=m, sd=std),col="darkblue", lwd=2,add=TRUE,yaxt="n")
316```
317
318While a histogram is helpful, it does not test normality. It could be the data are from some other symmetric or bell-shaped distribution. We need a way to *actually, objectively* check how reasonable it is for the data to be from a normal distribution. For this we can use a Quantile-Quantile Plot, or QQ Plot.
319
320The name Quantile-Quantile Plot comes from the fact that we are going to compare quantiles, also called percentiles. The first set of quantiles will be the actual quantiles of the data.
321
322First, we will rank the data, from smallest to largest.
323```{r}
324ranks<-rank(observed) #the ranking from smallest to largest of each data point
325ranks
326```
327
328Next, we will calculate the percentile, or the percent of data values that are less than the observed value. This similar to how we found the quantiles, Q1 (25th percentile), Median (50th percentile), and Q3 (75th percentile) in exploratory data analysis. For a rank of 3, for example, then three data points are less than or equal to that observed value and so it has percentile value of `r 3/12`. We often make a slight modification to the percentile calculation, though, to allow for the fact that the we would often expect the actual "first value" to have a value less than the minimum, or between rank 0 and rank 1. We also expect that our maximum should not be 100th percentile. To account for this, we subtract the quantity 0.5 so that we are exactly in the middle of the interval of ranks when we calculate the percentile (so instead of calculating the percentile for rank 1, we calculate the percentile for (1-0.5)/12).
329```{r}
330percentile<-(ranks-0.5)/length(ranks) #based on the rank, you calculate the proportion of values less than that value, this is the observed percentile
331percentile
332```
333
334Next, we will use the observed percentiles to calculate the z-score that would correspond to that percentile if the data was normally distributed.
335```{r}
336z.scores<-qnorm(percentile) #calculate the z-score that would correspond with the percentiles if the data was normal
337z.scores
338```
339
340Remember, the z-score is just a linear transformation of our data, or
341$$z=\frac{x-\mu}\sigma$$
342or with a little algebra
343$$z=\frac{1}\sigma x - \frac{\mu}\sigma$$
344So, if the data is normally distributed, if we plot the original data, $x$, and the z-scores found based on the percentiles, $z$, we should observe a linear relationship between them *if* the data **is** normally distributed.
345```{r}
346plot(z.scores,observed,xlab="The expected z-scores if the data is normally distributed",ylab="My sample observed values",main="My QQ plot")
347```
348
349R has a built-in function `qqnorm()`. The more linear relationship, the more likely it is that the Normal Distribution is a reasonable model for the data.
350```{r}
351qqnorm(observed) #builds the qq plot
352qqline(observed) #adds the line expected if the data is normal
353```
354
3552.1. Based on the observed values, is our variable normally distributed?
356
357> Yes the variable is normally distributed because the points fall reletivly along the line and there is slight deviation from the line at the tails yet that is expected because of the small sample size and it is not a large deviation
358
359
360
3612.2. Evaluate the normality of the Florida Lakes `pH` data to determine if the data is normally distributed based on the QQ Plot.
362
363```{r}
364require(Lock5Data)
365c<-(FloridaLakes)
366qqnorm(FloridaLakes$pH, xlab="pH for quantiles", ylab = "lake samples", main="Flordia Lake pH QQ plot") #builds the qq plot for flordia lake pH
367qqline(FloridaLakes$pH) #adds the line expected if the data is normal
368```
369
370> I believe that the Flordia Lake pH data is normaly distributed because the data does not deviate away from the 1:1 expected line very much and majority of the points fall on the line for the sample size
371
372
373
374
375***
376
377### Part 3: Assessing Normality
378
379#### Scenario 1: Gender Bias
380
381The following code produces the QQ plot for our male professor ratings.
382```{r}
383qqnorm(male, main="QQ Plot for Males") # produces a qq plot for the male data; labels the title so we know which is which later
384qqline(male) # adds the 1:1 line to the plot
385```
386
3873.1.1. Modify the code below to make a QQ plot for the female data.
388
389```{r}
390qqnorm(fem, main="QQ Plot for Females") # Produces a qq plot for females and lables it QQ plot for females
391qqline(fem) # adds the 1:1 line for the female QQ plot
392```
393
3943.1.2. Assess each plot for normality, including a justification.
395
396> Both data sets appear to be normal because they follow the 1:1 line for most of the data there is slight skew yet that could be expected for the sample size and even though the data is discrete and looks a bit deviated I believe that the data is normal because it follows the same line and does not curve or behave unexpectedly
397
398
3993.1.3. Why do the points fall in this step-wise pattern?
400
401> As I mentioned before the data is discrete so there are intervals of one that the data has to fall under and it can not be a decimal it must be a whole number within the interval 1-7.
402
403
404
405
406
407#### Scenario 3: Burrowing Owls and Dung Beetles
408
4093.3.1. Produce a qqplot for each of your burrowing owl beetle consumption by dung presence.
410
411
412```{r}
413qqnorm(w.d, main="QQ Plot for Owl burrows with dung") # produces a qq plot for the data of owl burrows with dung
414qqline(w.d) # adds the 1:1 line to the plot
415qqnorm(wo.d, main="QQ Plot for Owl burrows with out dung") # produces a qq plot for the owl burrows with out dung
416qqline(wo.d) # adds the 1:1 line to the plot
417
418```
419
420
4213.3.2. Assess each plot for normality, including a justification.
422
423> I believe that both data sets act normal because they follow the 1:1 line as expected and there is almost no variation from the line other then some discrete variables yet that is normal and expected.