· 5 years ago · Jul 15, 2020, 03:16 PM
1---
2title: 02 Chi Distribution
3tags: [DataSci, Notebooks/University/Thinking About Data, Statistics]
4created: '2020-06-15T00:23:19.200Z'
5modified: '2020-06-15T00:23:19.200Z'
6---
7
8> Related: [[Normal-Theory-Chi.md]]
9
10
11
12# Chi Distribution
13
14## Preamble
15
16
17```r
18load.pac <- function() {
19
20 if(require("pacman")){
21 library(pacman)
22 }else{
23 install.packages("pacman")
24 library(pacman)
25 }
26
27 pacman::p_load(xts, sp, gstat, ggplot2, rmarkdown, reshape2, ggmap, wesanderson,
28 parallel, dplyr, plotly, tidyverse, reticulate, UsingR, Rmpfr, latex2exp,
29 mise)
30
31# devtools::install_github("tidyverse/tidyverse")
32}
33
34load.pac()
35```
36
37```
38## Loading required package: pacman
39```
40
41```r
42mise()
43```
44
45## Wellness Data (Difference From Control)
46
47### (01) Enter Data
48
49Now be really careful here, make sure that the column names you choose are:
50
51* [Syntactically Correct](https://rdrr.io/r/base/make.names.html) [^style]
52 + Style Guides recommend lower case, `snake_case` for variable and function names (using nouns in the prior and verbs in the latter), this would include vectors.
53 + Try to avoid using dots, the S3 scheme for defining classes uses `.`'s so you might end up with a confusing method like `as.data.frame.data.frame()`
54 + I'm using camelCase for DF column names in order to distinguish them from variables and make them clearer inside `ggplot` as opposed to `snake_case`.
55* Spelt correctly
56 + You will need matching column names in order to use things like `pivot_longer` etc.
57* Short enough to type in without making a spelling Mistake
58 + Same as above, remenber axis-titles and so forth are distinct from data frame names.
59
60 [^style]: [Style Guide](http://r-pkgs.had.co.nz/style.html)
61
62
63```r
64iraqi = c(123, 70, 93, 157)
65names(iraqi) = c("low", "moderate", "high", "veryHigh")
66head(iraqi)
67```
68
69```
70## low moderate high veryHigh
71## 123 70 93 157
72```
73
74```r
75str(iraqi)
76```
77
78```
79## Named num [1:4] 123 70 93 157
80## - attr(*, "names")= chr [1:4] "low" "moderate" "high" "veryHigh"
81```
82
83### BarPlot
84
85#### Base Plot
86
87
88```r
89barplot(iraqi)
90```
91
92
93
94#### GGPlot2
95
96##### Tidy Data
97
98This can be done in ggplot2, but first a `tidy` data frame needs to be constructed. Tidy data satisfies the following three rules:
99
1001. Each variable will have it's own column
1012. Each ovservation will have it's own row
1023. Each Value will have it's own cell.
103
104A tidy data frame somewhat depends on context, for example:
105
106* If only the refugees were sampled, the data would be structured where:
107 + Observations would be:
108 + Low
109 + Medium
110 + High
111 + Variables would be:
112 + Count
113* If Multiple populations were sampled, for example the Australian population and the iraqi populations, the tidy data frame may be such that:
114 + Observations would be:
115 + Australia
116 + Iraqi
117 + variables would be:
118 + Count
119 + Distress Level
120* If Individuals were classified into a distress category, the corresponding tidy data frame would be:
121 + Observations would be:
122 + Individuals
123 + variables would be:
124 + Country of Origin/Residence
125 + Distress Level
126
127##### Make the Plot
128
129
130```r
131#pivot_longer(data = iraqi, cols = names(iraqi))
132iraqi.tidy <- melt(iraqi, value.name = "Count") %>% as_tibble(rownames = "Distress")
133iraqi.tidy
134```
135
136```
137## # A tibble: 4 x 2
138## Distress Count
139## <chr> <dbl>
140## 1 low 123
141## 2 moderate 70
142## 3 high 93
143## 4 veryHigh 157
144```
145
146```r
147# Base
148#barplot(height = iraqi.tidy$Count, names.arg = iraqi.tidy$Distress)
149
150# GGPlot2
151ggplot(data = iraqi.tidy, mapping = aes(x = Distress, y = Count)) +
152 geom_col()
153```
154
155
156
157##### Fix the Order of the Plot
158In making this plot you may observe that the order of the plot has been made to be alphabetical, the order of the data frame has been disregarded.
159
160This is desirable and expected behaviour, the values of distress have not been correctly encoded, they need to be encoded as ordered factors in order to be ordered, unordered factors may as well be placed in alphabetical order.
161
162
163```r
164iraqi.tidy <- melt(iraqi, value.name = "Count") %>% as_tibble(rownames = "Distress")
165iraqi.tidy$Distress <- factor(iraqi.tidy$Distress, levels = iraqi.tidy$Distress, ordered = TRUE)
166iraqi.tidy
167```
168
169```
170## # A tibble: 4 x 2
171## Distress Count
172## <ord> <dbl>
173## 1 low 123
174## 2 moderate 70
175## 3 high 93
176## 4 veryHigh 157
177```
178
179```r
180# Base
181fillCols <- RColorBrewer::brewer.pal(nrow(iraqi.tidy), name = "Pastel1")
182barplot(height = iraqi.tidy$Count, names.arg = c("Low", "Moderate", "High", "Very High"), col = fillCols, main = "Distress Levels of Iraqi Refugees", xlab = "Distress Level", ylab = "Count")
183```
184
185
186
187```r
188# GGPlot2
189ggplot(data = iraqi.tidy, mapping = aes(x = Distress, y = Count)) +
190 geom_col(mapping = aes(col = Count, fill = Distress)) +
191 theme_classic() +
192 labs(title = "Distress Levels of Iraqi Refugees") +
193 guides(col = FALSE)
194```
195
196
197
198### (02) Enter the AIHW Data
199
200The *Australian Institute of Health and Wellness* data are as follows:
201
202
203```r
204aihw <- c("low" = 70.65, "moderate" = 18.5, "high" = 7.41, "veryHigh" = 3.43)
205aihw.tidy <- tibble::enframe(aihw) # %>% cbind(aihw, iraqi)
206names(aihw.tidy) <- c("Distress", "Count")
207aihw.tidy$Distress <- factor(x = aihw.tidy$Distress, levels = aihw.tidy$Distress, ordered = TRUE)
208aihw.tidy
209```
210
211```
212## # A tibble: 4 x 2
213## Distress Count
214## <ord> <dbl>
215## 1 low 70.6
216## 2 moderate 18.5
217## 3 high 7.41
218## 4 veryHigh 3.43
219```
220
221#### Combine all Observations
222
223Ideally all the data should be combined into a single data set, be mindful that `pivot_longer()` is gonna complain if columns with the same name have different data types, so make sure to remember to re-class categories as factors rather than factors.:
224
225
226```r
227# First add a variable that can be used to distinguish the two Data Sets
228iraqi.tidy$Region <- "Iraq"
229aihw.tidy$Region <- "Australia"
230# Combine the Data Sets
231all.tidy <- rbind(iraqi.tidy, aihw.tidy)
232all.tidy
233```
234
235```
236## # A tibble: 8 x 3
237## Distress Count Region
238## <ord> <dbl> <chr>
239## 1 low 123 Iraq
240## 2 moderate 70 Iraq
241## 3 high 93 Iraq
242## 4 veryHigh 157 Iraq
243## 5 low 70.6 Australia
244## 6 moderate 18.5 Australia
245## 7 high 7.41 Australia
246## 8 veryHigh 3.43 Australia
247```
248
249```r
250all.tidy$Distress <- factor(all.tidy$Distress, levels = iraqi.tidy$Distress, ordered = TRUE)
251
252# Use Pivot Wider in order to make the column names the Region and the variable the count
253all.wide <- pivot_wider(data = all.tidy, names_from = Region, values_from = Count)
254all.wide
255```
256
257```
258## # A tibble: 4 x 3
259## Distress Iraq Australia
260## <ord> <dbl> <dbl>
261## 1 low 123 70.6
262## 2 moderate 70 18.5
263## 3 high 93 7.41
264## 4 veryHigh 157 3.43
265```
266
267#### Plot the aihw Data
268
269
270```r
271# Base Plot
272
273fillCols <- RColorBrewer::brewer.pal(nrow(iraqi.tidy), name = "Pastel2")
274barplot(height = all.wide$Australia, names.arg = c("Low", "Moderate", "High", "Very High"), col = fillCols, main = "Distress Levels of Iraqi Refugees", xlab = "Distress Level", ylab = "Count")
275```
276
277
278
279```r
280## ggplot2
281ggplot(data = all.tidy, mapping = aes(x = Distress, y = Count, fill = Region, col = Count)) +
282 geom_col(position = "dodge") +
283 guides(col = FALSE) +
284 theme_classic() +
285 labs(title = "Distress of Refugees", subtitle = "", y = "Frequency")
286```
287
288
289
290### (03) Determine Expected Frequency
291
292#### Hypothesis
293
2941. $H_0 \enspace : \quad$ The Refugee Distress Categories will have frequencies equal to Australia
2952. $H_a \enspace : \quad$ There will be a difference between the categories
296
297Assuming that the null hypothesis is true, the expected frequency of the categories can be deterimed:
298
299$$
300\textsf{e} = 443 \times \frac{\texttt{aihw}}{100}
301$$
302
303```r
304all.wide$IraqExpected <- all.wide$Australia * (sum(all.wide$Iraq)/100)
305
306# Rename the Column to reflect expected and observed frequencies
307
308## using Dplyer
309all.wide %>%
310 dplyr::rename(
311 IraqObserved = Iraq
312 )
313```
314
315```
316## # A tibble: 4 x 4
317## Distress IraqObserved Australia IraqExpected
318## <ord> <dbl> <dbl> <dbl>
319## 1 low 123 70.6 313.
320## 2 moderate 70 18.5 82.0
321## 3 high 93 7.41 32.8
322## 4 veryHigh 157 3.43 15.2
323```
324
325```r
326## using Base Functions
327names(all.wide)[names(all.wide)=="Iraq"] <- "IraqObserved"
328
329# Print the DataFrame
330all.wide
331```
332
333```
334## # A tibble: 4 x 4
335## Distress IraqObserved Australia IraqExpected
336## <ord> <dbl> <dbl> <dbl>
337## 1 low 123 70.6 313.
338## 2 moderate 70 18.5 82.0
339## 3 high 93 7.41 32.8
340## 4 veryHigh 157 3.43 15.2
341```
342
343
344### (04)Compute the Chi-Squared Distance
345
346The *Chi-Squared* ($\chi^2$) statistic is the squared distance from the the expected and observed values to the expected value:
347
348$$
349\chi^2 = \sum^n_{i=1} \left[ \frac{(o-e)^2}{e} \right]
350$$
351
352This can be done readily in **_R_**:
353
354
355```r
356o <- all.wide$IraqObserved
357e <- all.wide$IraqExpected
358
359all.wide$ChiDist <- ((o-e)^2/e)
360ChiStat <- sum(all.wide$ChiDist)
361ChiStat
362```
363
364```
365## [1] 1550.75
366```
367
368And returns the value $\chi^2 \approx 1551$
369
370### (05) Similate the Counts
371
372A distribution with multiple categories of different probabilities is a **multinomial** distribution and can be simulated:
373
374
375```r
376rmultinom(n = 1, size = 443, prob = aihw/100)
377```
378
379```
380## [,1]
381## low 316
382## moderate 86
383## high 25
384## veryHigh 16
385```
386
387```r
388# A more Rigurous simulation by averaging various other simulations
389average_sim_count <- rmultinom(n = 10000, size = 443, prob = aihw/100) %>% rowMeans()
390all.wide$IraqSimulated <- average_sim_count
391all.wide
392```
393
394```
395## # A tibble: 4 x 6
396## Distress IraqObserved Australia IraqExpected ChiDist IraqSimulated
397## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
398## 1 low 123 70.6 313. 115. 313.
399## 2 moderate 70 18.5 82.0 1.74 82.0
400## 3 high 93 7.41 32.8 110. 32.9
401## 4 veryHigh 157 3.43 15.2 1323. 15.2
402```
403
404```r
405# Building a Confidence Interval for the true mean given this Distribution
406# The confidence interval is meaningless really, it can be made arbitrarily small by
407# making n sufficietnly large, this is just to illustrate confidence intervals for
408# population means given a sample of sample means.
409sd_sim_count <- rmultinom(n = 1000, size = 443, prob = aihw/100) %>% apply(MARGIN = 1, sd) # 1 is row, 2 is column
410
411# Calculate the t-statistic for 95%
412t <- qt(p = 0.95, df = 999)
413
414data.frame(
415 averageSimCount = average_sim_count,
416 sdSimCount = sd_sim_count,
417 lowerConfidenceForMean= round(average_sim_count - t * (sd_sim_count)/sqrt(1000)),
418 upperConfidenceForMean = round(average_sim_count + t * (sd_sim_count)/sqrt(1000))
419)
420```
421
422```
423## averageSimCount sdSimCount lowerConfidenceForMean upperConfidenceForMean
424## low 312.8818 9.557677 312 313
425## moderate 81.9786 8.321460 82 82
426## high 32.9205 5.451160 33 33
427## veryHigh 15.2191 3.780478 15 15
428```
429
430Or this could be simulated by using the `sample()` function:
431
432
433```r
434frequency <- sample(1:4, 443, replace = TRUE, prob=all.wide$Australia) %>%
435 table %>%
436 tibble::enframe()
437
438names(frequency) <- c("Distress", "Count")
439frequency$Distress <- names(iraqi)
440frequency
441```
442
443```
444## # A tibble: 4 x 2
445## Distress Count
446## <chr> <table>
447## 1 low 307
448## 2 moderate 80
449## 3 high 40
450## 4 veryHigh 16
451```
452
453### (06) Put Everything together
454
455So the idea is, in order to test the hypothesis that there is no difference we will set up a hypothesis test.
456
457The test statistic will be the total squared distance relative to the expected value for the distribution, this is known as the $\chi ^2$ value.
458
459If we can:
460
461* Assume the null hypothesis is true
462 + (no difference between the number of iraqi counts and the number of `aihw` counts)
463* With only 5% probability of getting a false positive under this assumption
464
465then we will reject the null hypothesis that they are the same and accept that there is a significant difference between the distribution of distress in the iraqi population.
466 + (this necessarily show that the iraqi refugees are more distressed, merely that the distribution is different from the Australian distribution in such a way that would be unlikely to be a false positive assuming they were identical.)
467
468In order to evaluate this, We can take a random sample of values that are distributed with the same frequency as the `aihw` data and measure how often the corresponding $\chi^2$ value of the sampled values exceed that of the observed values for the iraqi values, under the assumption that the iraqi values are distributed at the same frequency of the `aihw`, this would amount to a False Positive for a difference.
469
470Creating many chi statistics, comparing them and repeating gives a false positive rate:
471
472
473```r
474# Calcualate the iraqi Chi Stat
475e <- aihw/100 * sum(iraqi)
476o <- iraqi
477iraqi_chi <- sum((e-o)^2/e)
478
479# Simulate the False Positive Rate for data following the Aus Distribution
480n <- 10^3 # Simulation Length
481FalsePosVec <- vector(length = n)
482for (i in 1:n) {
483 # e <- rmultinom(1, 443, aihw/100)
484 e <- sample(1:4, 443, replace = TRUE, prob=aihw) %>% table
485 o <- aihw
486# o <- iraqi
487 sim_chi <- sum((e-o)^2/e )
488 # Assume null hypotheses, this means we assume the iraqi distance does not exceed the sim distance
489 # The number of times it does exceed is:
490 # the probability of a false positive assuming the null hypothesis is correct
491 # this is the p-value.
492 FalsePosQ <- !(iraqi_chi < sim_chi) #(assuming null hypothesis means assuming iraqi distance leq to sim)
493 # The number of positives will be the FPR and indicative of
494 # the p-value)
495 FalsePosVec[i] <- FalsePosQ
496}
497
498pval <- mean(FalsePosVec)
499print(pval)
500```
501
502```
503## [1] 1
504```
505
506This could be made more efficient by using `replicate` rather than a `for` loop (`replicate` is to **_R_** as `Table[]` is to *Mathematica*):
507
508
509```r
510e <- aihw/100 * sum(iraqi)
511o <- iraqi
512iraqi_chi <- sum((e-o)^2/e)
513
514FalsePosCount <- replicate(10^4, expr = {
515 o <- rmultinom(n = 1, size = sum(iraqi), prob = aihw)
516 sim_chi <- sum((e-o)^2/e)
517 !(iraqi_chi < sim_chi)
518})
519
520mean(FalsePosCount)
521```
522
523```
524## [1] 1
525```
526
527This returns a value of 1, indicating that the probability of a false positive, assuming that the data was randomly sampled following the probabilities of the proportions of the Australian populatation, is effectively 1, thus the null hypothesis should be rejected and the alternative hypothesis accepted.
528
529### (07) Use the InBuilt Chi-Square Statistic
530
531The chi-squared value that would correspond to a false positive rate like in the above simulation, may be determined by integrating the appropriate probability density function:
532
533$$
534\begin{aligned}
535f_n\left( x \right)= \frac{1}{2^{\frac{n}{2}} \cdot \Gamma\left( \frac{n}{2} \right)} \cdot x^{\frac{n}{2} - 1}\cdot e^{- \frac{x}{2}}
536\end{aligned}
537$$
538
539where the mean and variance are $n$ and $2n$ respectively; $\Gamma\left( x \right)$ is the gamma function it's very similar to the factorial function ($x!$):
540
541$$\begin{aligned}
542x! &= \Gamma\left( x+ 1 \right)\\
543x! &= x \cdot \Gamma\left( x \right)\\
544\Gamma\left( n \right)&= \left( n- 1 \right)!, \qquad \forall n \in \mathbb{Z} \setminus \mathbb{Z}^- \\
545\Gamma\left( z \right) &= \int_{0}^{\infty}\left( \left( \# \right)^{z- 1}\cdot e^{-\left( \# \right)} \right) \mathrm{d} \left( \# \right), \qquad z \in \mathbb{C} \wedge \Re\left( z \right)>0
546\end{aligned}$$
547
548This doesn't seem quick to solve, plugging it into *Mathematica* gives:
549
550```wolfram
551Integrate[ 1/2^(n/2*Gamma[n/2])*x^(n/2 - 1)*e^(-x/2), {x, 0, \[Infinity]}]
552```
553
554```
555ConditionalExpression[(2^(1/2 - (3 Sqrt[\[Pi]])/4) Sqrt[\[Pi]])/ Log[e]^(3/2), Re[Log[e]] > 0]
556 ```
557
558 $$
559\int^{\infty}_0\frac{1}{2^{\frac{n}{2}} \cdot \Gamma\left( \frac{n}{2} \right)} \cdot x^{\frac{n}{2} - 1}\cdot e^{- \frac{x}{2}} \enspace \mathrm{d}x = \frac{2^{\frac{1}{2}-\frac{3 \sqrt{\pi }}{4}} \sqrt{\pi }}{\log ^{\frac{3}{2}}(e)}
560 $$
561
562Howerver inside **_R_** this is all built into the `pchisq()` function and the null hypothesis may be evaluated without necessarily undertaking the simulation.
563
564#### Evaluate Test Statistic Using Chi Statistic
565
566The probability of a false positive, assuming that the null hypothesis is true can be determined directly from the critical values of the Chi-Statistic.
567
568
569
570```r
571# the null hypothesis is that there is no difference, the
572# probability of detecting a difference will be the upper tail and would be the p-value
573pchisq(q = iraqi_chi, df = (length(aihw)-1), lower.tail = FALSE)
574```
575
576```
577## [1] 0
578```
579
580It isn't even necessary to calculate the $\chi^2$ value, this is built into **_R_** and can be done all at once:
581
582
583```r
584chisq.test(x = iraqi, p = aihw, rescale.p = TRUE)
585```
586
587```
588##
589## Chi-squared test for given probabilities
590##
591## data: iraqi
592## X-squared = 1550.6, df = 3, p-value < 2.2e-16
593```
594
595As opposed to using the $\chi^2$ distribution, it is possible to use a *Monte Carlo* simulation all in one line as well:
596
597
598```r
599chisq.test(x = iraqi, p = aihw, rescale.p = TRUE, simulate.p.value = TRUE, B = 10^3)
600```
601
602```
603##
604## Chi-squared test for given probabilities with simulated p-value (based on 1000 replicates)
605##
606## data: iraqi
607## X-squared = 1550.6, df = NA, p-value = 0.000999
608```
609
610I can't think of any reason to use the monte carlo simulation over the density distribution though
611
612
613
614
615
616
617## Eels Data (Comparison of Two Populations)
618
619### (1) Enter the Data
620
621The data can be entered as a data frame or a matrix, the prior will be better for plotting and visualisation but the latter may be the expected format for various built in functions:
622
623
624```r
625# Create Vectors
626g_moringa <- c("BoRdeR" = 264,"GraSs" = 127,"SaNd" = 99)
627g_vicinus <- c("BoRder" = 161,"GraSs" = 116,"SaNd" = 67)
628 # Capitals to emphasise change later with dimnames
629
630# Create a Matrix
631eel_Mat <- rbind(g_moringa, g_vicinus)
632dimnames(eel_Mat) <- list(species = c("G.moringa", "G.vicinus"),
633 location = c("Border", "Grass", "Sand")
634 )
635
636
637# Create a Data Frame
638eel_DF <- eel_Mat %>%
639 as_tibble() %>%
640 add_column("Species" = factor(rownames(eel_Mat))) %>%
641 dplyr::select(Species, Border, Grass, Sand)
642
643eel_DF # %>% kable()
644```
645
646```
647## # A tibble: 2 x 4
648## Species Border Grass Sand
649## <fct> <dbl> <dbl> <dbl>
650## 1 G.moringa 264 127 99
651## 2 G.vicinus 161 116 67
652```
653
654#### Plot the Data
655
656In order to plot the data a tidy data frame needs to be made using `tidyr::pivot_longer()` or `reshape2::melt()`.
657A custom colour pallet can be specified using the following layers [^ggplotColLayer]:
658
659[^ggplotColLayer]: [ggplot2 colors : How to change colors automatically and manually? - Easy Guides - Wiki - STHDA](http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually)
660
661
662* Discrete Data
663 + `scale_fill_manual()`
664 + to change the fill of the object
665 + `scale_color_manual()`
666 + to change the colour of the **outline** of the object
667 + Using Built in Palletes:
668 + `RColorBrewer`
669 + `scale_fill_brewer(palette="Dark2")`
670 + `scale_color_brewer(palette="Dark2")`
671 + `wesanderson`
672 + `scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest"))`
673 + `scale_color_(values=wes_palette(n=3, name="GrandBudapest"))`
674* Continuous Data
675 + `scale_color_gradient(low="blue", high="red")`
676 + Color is the outline, so think Scatter Plot
677 + To Have a Diverging Pallet:
678 + `scale_color_gradient2`(midpoint=mid, low="blue", mid="white", high="red", space ="Lab" )
679 + To have a pallet of n different colours (in this example 7:
680 + `scale_color_gradientn(colours = rainbow(7))`
681 + `scale_fill_gradient()`
682 + Fill is the filling so think Histogram
683
684scale_color_gradient(low="blue", high="red")
685
686
687
688```r
689# Create a Tidy Data Frame
690
691## Using Pivot Longer from `tidyverse` (dev git repo)
692eel_DF_Tidy <- pivot_longer(data = eel_DF,
693 cols = names(eel_DF[,-1]),
694 names_to = "Location",
695 values_to = "Count")
696
697eel_DF_Tidy$Species <- factor(eel_DF$Species)
698eel_DF_Tidy$Location <- factor(eel_DF_Tidy$Location)
699
700## Using Melt from `reshape2`
701melt(eel_DF, ) %>% dplyr::rename("Location" = variable,
702 "Count" = value)
703```
704
705```
706## Using Species as id variables
707```
708
709```
710## Species Location Count
711## 1 G.moringa Border 264
712## 2 G.vicinus Border 161
713## 3 G.moringa Grass 127
714## 4 G.vicinus Grass 116
715## 5 G.moringa Sand 99
716## 6 G.vicinus Sand 67
717```
718
719```r
720 # Instead of using dplyr I could have used `variable.name=`...,
721 # just done for reference
722
723## ggplot2
724violetBluePallet <- c("#511FB5", "dodgerblue3", "#e31a1c")
725ggplot(data = eel_DF_Tidy,
726 mapping = aes(x = Location,
727 y = Count,
728 fill = Species,
729 col = Count)) +
730 geom_col(position = "dodge") +
731 guides(col = FALSE) +
732 theme_classic() +
733 labs(title = "blah", subtitle = "", y = "Frequency") +
734 scale_fill_manual(values=violetBluePallet)
735```
736
737
738
739
740### (2) Expected Values
741
742In this case the two hypothesis are:
743
744* $H_0: \quad$ the two species are distriuted with the same frequency
745 + So the proportion in both areas is assumed to be equal
746 + Given this assumption we may take the proportion of the total number
747 of observations that occur in this area and this will be equal to averaging the proportions
748 of each species that occur in the two areas.
749* $H_a: \quad$ there is a difference between the distribution of the two frequencies
750
751
752
753Under the assumption that both species have the same distribution (i.e. assume $\mathrm{H}_0$ is true) each term $x_{ij}$ will have an expected frequency of $f = \frac{1}{n} \cdot \sum^{2}_{i= 1} \left[x_i \right]$ and hence the expected count would be the frequency multiplied by the total number of observed species:
754
755$$\begin{aligned}
756 x_{ij}&= f \cdot \sum^{3}_{j= 1} \left[ x_{ij} \right] \\
757&= \frac{1}{n} \cdot \sum^{2}_{i= 1} \left[x_i \right] \cdot \sum^{3}_{j= 1} \left[ x_{ij} \right]
758\end{aligned}$$
759
760so the resulting matrix of counts would be:
761
762$$\begin{aligned}
763 \begin{bmatrix} 490 \\ 344 \end{bmatrix}
764 \times
765 \begin{bmatrix} 0.5 & 0.29 & 0.2 \end{bmatrix} \\
766 = \begin{bmatrix} 250 & 143 & 98 \\
767 175 & 100 & 68 \end{bmatrix}
768\end{aligned}$$
769
770
771Assuming that the null hypothesis is true, the expected distribution between areas could be calculated by using matrix multiplcication:
772
773
774```r
775species_counts <- rowSums(eel_Mat)
776location_proportions <- colSums(eel_Mat)/sum(eel_Mat)
777
778# Now Perform matrix Multiplication
779 species_counts
780```
781
782```
783## G.moringa G.vicinus
784## 490 344
785```
786
787```r
788 location_proportions
789```
790
791```
792## Border Grass Sand
793## 0.5095923 0.2913669 0.1990408
794```
795
796```r
797 as.matrix(species_counts) %*% t(as.matrix(location_proportions))
798```
799
800```
801## Border Grass Sand
802## G.moringa 249.7002 142.7698 97.52998
803## G.vicinus 175.2998 100.2302 68.47002
804```
805
806This is actually the definition of the outer product; the [*Outer Product*](https://en.wikipedia.org/wiki/Outer_product#Definition) is defined as:
807
808
809
810$$
811\mathbf{u} \otimes \mathbf {v} =\mathbf {u} \mathbf {v} ^{\textsf {T}}={\begin{bmatrix}u_{1}\\u_{2}\\u_{3}\\u_{4}\end{bmatrix}}{\begin{bmatrix}v_{1}&v_{2}&v_{3}\end{bmatrix}}={\begin{bmatrix}u_{1}v_{1}&u_{1}v_{2}&u_{1}v_{3}\\u_{2}v_{1}&u_{2}v_{2}&u_{2}v_{3}\\u_{3}v_{1}&u_{3}v_{2}&u_{3}v_{3}\\u_{4}v_{1}&u_{4}v_{2}&u_{4}v_{3}\end{bmatrix}}.
812$$
813In **_R_** vectors [^vecspace] of length $m$ are treated as $m \times 1$ matrices as can be observed by evaluating `as.matrix(1:3)` and `t(as.matrix(1:3))`, this means that the outer product of two vectors will be equivalent to:
814
815
816$$
817\mathbf {u} \otimes \mathbf {v} =\mathbf {A} ={\begin{bmatrix}u_{1}v_{1}&u_{1}v_{2}&\dots &u_{1}v_{n}\\u_{2}v_{1}&u_{2}v_{2}&\dots &u_{2}v_{n}\\\vdots &\vdots &\ddots &\vdots \\u_{m}v_{1}&u_{m}v_{2}&\dots &u_{m}v_{n}\end{bmatrix}}
818$$
819
820
821
822
823[^vecspace]: In this context a matrix is vector in the sense that matrices can be used to satisfy the axiom of vector addition for a Vector Space, refer to 4.2 of R Larson's *Linear Algebra* 7th ed.
824
825
826
827So the expected occurence rate of the species would be:
828
829
830```r
831# Determine how many species there are
832species_counts <- rowSums(eel_Mat)
833species_proportions <- rowSums(eel_Mat)/n
834# Determine the area proportions
835location_proportions <- colSums(eel_Mat)/sum(eel_Mat)
836
837# Calculate the expected distribution of that number for those proportions
838expected_counts <- base::outer(species_counts, location_proportions)
839
840print(list(species_counts, location_proportions, expected_counts, base::outer(location_proportions, species_counts) ))
841```
842
843```
844## [[1]]
845## G.moringa G.vicinus
846## 490 344
847##
848## [[2]]
849## Border Grass Sand
850## 0.5095923 0.2913669 0.1990408
851##
852## [[3]]
853## Border Grass Sand
854## G.moringa 249.7002 142.7698 97.52998
855## G.vicinus 175.2998 100.2302 68.47002
856##
857## [[4]]
858## G.moringa G.vicinus
859## Border 249.70024 175.29976
860## Grass 142.76978 100.23022
861## Sand 97.52998 68.47002
862```
863
864or if you're willing to remember that:
865
866$$
867e_{ij} = \frac{\sum{[\textsf{row}]} * \sum{[\textsf{col}]}}{n}
868$$
869
870
871```r
872e <- matrix(1:6, nrow = 2)
873 for (i in 1:nrow(eel_Mat)) {
874 for (j in 1:ncol(eel_Mat)) {
875 e[i,j] <- colSums(eel_Mat)[j] * rowSums(eel_Mat)[i] / n
876 }
877 }
878```
879
880
881
882### (3) Simulate The Values
883
884The game plan here is to:
885
8861. Assume that the null hypothesis is true:
887 + The observations are distributed equally across features:
888 + $e_{ij} = \sum \left[ \textsf{rows} \times \right] \sum \left[ \textsf{cols} \right] \times \frac{1}{n}$
8892. Randomly sample values at the same probability
890 + given this simulated observation, determine what the expected
891 distribution would be determined to be assuming that the null
892 hypothesis was true.
893 + Determine what the corresponding $\chi^2$ value is.
894 + the number of times that this simulated $\chi^2$ value is
895 greater than the observed $\chi^2$ value is
896 the _**F**alse **P**ositive **R**ate_
897
898#### Sample a Single Value
899
900In order to simulate the values we need simulate data distributed at given probabilities, this is known as a multinomial distribution, it's essentially rolling a really oddly lopsided die that matches the probabilities specified.
901
902From that sample it is necessary to calculate what we would determine the expected distribution to be assuming that the null hypothesis was true:
903
904
905```r
906overall_proportion <- expected_counts/sum(expected_counts)
907
908# Presuming that R's internal structure is consistent
909rmultinom(n = 1, size = sum(eel_Mat), prob = overall_proportion) %>% matrix(ncol = 3, nrow = 2)
910```
911
912```
913## [,1] [,2] [,3]
914## [1,] 229 153 96
915## [2,] 184 102 70
916```
917
918```r
919# Presuming it's not
920dist_prob <- overall_proportion %>% as.vector
921obs_sim <- rmultinom(n = 1, size = sum(eel_Mat), prob = dist_prob) %>% matrix(ncol = 3, nrow = 2)
922e_sim <- 1/n*outer(X = rowSums(obs_sim), colSums(obs_sim))
923
924print(list(obs_sim, e_sim), 1)
925```
926
927```
928## [[1]]
929## [,1] [,2] [,3]
930## [1,] 228 144 107
931## [2,] 193 99 63
932##
933## [[2]]
934## [,1] [,2] [,3]
935## [1,] 202 116 81
936## [2,] 149 86 60
937```
938
939```r
940## Sanity Check
941#1:6 %>% matrix(nrow = 2, ncol =3) %>% as.vector()
942#1:6 %>% as.matrix() %>% as.vector() %>% matrix(nrow = 2, ncol =3)
943#1:6 %>% as.matrix() %>% as.vector() %>% matrix(nrow = 2, ncol =3) %>% as.vector()
944#1:6 %>% as.matrix() %>% as.vector() %>% matrix(nrow = 2, ncol =3) %>% as.vector() %>% matrix(nrow = 2, ncol =3)
945```
946
947If this was repeated many times over, the number of times that the $\chi^2$ statistic was sufficiently extreme to reject the null hypothesis would represent the false positive rate, which would be an acceptable estimate for the probability of a type I error, the $p$-value:
948
949> The probability of rejecting the null hypothesis under the assumption that it is true (i.e. under the assumption that there is no true effect). Careful, this is different from the false discovery rate
950
951This simulation is under the assumption that the null hypothesis is true and that the two populations are distributed equally, so the null hypothesis assumes that:
952
953$$
954\begin{aligned}
955&\mathrm{H}_0:\quad \chi^2_{obs} < \chi^2_{sim} \\
956\end{aligned}
957$$
958
959A False Positive would be an observation that violates that assumption, if the probability of a false positive, the $p$ -value:
960
961* Sufficiently small, the null hypothesis will be rejected.
962* too high the null hypothesis will not be rejected.
963
964#### Simulate Samples of that frequency
965
966Calculate the Chi Distribution for the observations which will become the test statistic:
967
968
969```r
970# Create expected and observed vectors
971e <- expected_counts
972o <- eel_Mat
973n <- sum(eel_Mat)
974
975chi_obs <- sum((e-o)^2/e)
976
977#obs_sim <- rmultinom(n = 1, size = sum(eel_Mat), prob = dist_prob) %>% matrix(ncol = 3, nrow = 2)
978```
979
980
981The idea of the simulation is to generate observations at the proportion assumed by the null hypothesis and reduce these matrices to corresponding $\chi^2$ values.
982
983
984Simulate samples at the same proportion and sample the $\chi^2$ statistic:
985
986
987```r
988# Simulate distribution
989 # Use `Replicate` not `for` because it's faster
990
991dist_prob <- overall_proportion %>% as.vector
992 # I could also have used rmultinom to sample the split of the population across two species
993 # Then split those species amont locations
994 # or made two samples of the species and location dist and then used `table()`
995
996sim_chi_vec <- replicate(10^4, {
997
998 ## Simulate Samples
999obs_sim <- rmultinom(n = 1, size = sum(eel_Mat), prob = dist_prob) %>% matrix(ncol = 3, nrow = 2)
1000
1001 ## Calculate the expected values from the sample
1002 ## Assuming the null hypothesis that both rows are equal.
1003e <- outer(rowSums(obs_sim), colSums(obs_sim))/n
1004
1005 ## Calculate the Chi Squared Statistic
1006sim_chi <- sum((e-obs_sim)^2/e)
1007sim_chi
1008
1009})
1010```
1011
1012If a simulated distribution had a $\chi^2$ value more extreme than the observation, the null hypothesis would be rejected, the simulation was generated under circumstanes where the null hypothesis was true and so this would be a false positive or a *Type I Error*.
1013
1014The rate of false positives is an estimator for the probability of commiting a Type I error (the $p$ -value), this can be calculated:
1015
1016
1017
1018```r
1019calculate_p_value <- function() {
1020 mean(falsepos())
1021}
1022
1023 falsepos <- function() {
1024 # If the critical value was our observation, would the simulation be less extreme?
1025 # If not this is a false positive
1026 # If the simulated value is more extreme than the
1027 !(sim_chi_vec > chi_obs)
1028 sim_chi_vec > chi_obs
1029 }
1030
1031p <- calculate_p_value()
1032
1033paste("The Probability of rejecting the null hypothesis, assuming that it was true (i.e. detecting a false positive assuming there is no difference between species) is", p) %>% print()
1034```
1035
1036```
1037## [1] "The Probability of rejecting the null hypothesis, assuming that it was true (i.e. detecting a false positive assuming there is no difference between species) is 0.0403"
1038```
1039
1040```r
1041"This is too high and so the null hypothesis is not rejected." %>% print()
1042```
1043
1044```
1045## [1] "This is too high and so the null hypothesis is not rejected."
1046```
1047
1048Hence the probability of rejecting the null hypothesis when it is true is quite small, the $p$ -value is less than 5% and so the null hypothesis is rejected and the alternative hypothesis, that the species are distributed differently is accepted.
1049
1050This can be visualised in a histogram:
1051
1052
1053
1054```r
1055# Plot a Histogram
1056
1057 ## Base Plot
1058 myhist <- hist(sim_chi_vec, breaks = 50)
1059```
1060
1061
1062
1063```r
1064 plot(myhist, col = ifelse(myhist$mid<6, "white", "lightblue"),
1065 main= latex2exp::TeX("Simulated $\\chi^2$ Distances"),
1066 freq = FALSE,
1067 xlab = TeX("$\\chi^2$ distance"))
1068 abline(v = chi_obs, col = "Indianred", lty = 2, lwd = 3)
1069```
1070
1071
1072
1073```r
1074 # Conditional Colour: # https://stat.ethz.ch/pipermail/r-help/2008-July/167936.html
1075
1076 ## GGPlot2
1077
1078 ### Make a Tidy Data Frame
1079 chi_vals_Tib <- tibble::enframe(sim_chi_vec)
1080 chi_vals_Tib$name[sim_chi_vec<chi_obs] <- "TruePos"
1081 chi_vals_Tib$name[sim_chi_vec>chi_obs] <- "FalsePos"
1082
1083 ### Plot the Data
1084 ggplot(data = chi_vals_Tib, mapping = aes(x = value))+
1085 geom_histogram(binwidth = 0.4, col = "white", aes(fill = name)) +
1086 theme_classic() +
1087 guides(fill = guide_legend("Conclustion\nStatus")) +
1088 geom_vline(xintercept = chi_obs, lty = 3, col = "purple") +
1089 scale_fill_manual(values = c("indianred", "lightblue"),
1090 labels = c("False Positive / Type I", "True Positive")
1091 ) +
1092 labs(title = latex2exp::TeX("Simulated $\\chi^2$ Values")) +
1093 scale_x_continuous(limits = c(0, 11.5))
1094```
1095
1096```
1097## Warning: Removed 32 rows containing non-finite values (stat_bin).
1098```
1099
1100```
1101## Warning: Removed 4 rows containing missing values (geom_bar).
1102```
1103
1104
1105
1106
1107
1108### (4) Use the Chi Squared Distribution
1109
1110The $p$-value is a function of:
1111
1112* The degrees of freedom
1113 + $\mathsf{d.f. = (\mathsf{rows}-1) \times (\mathsf{cols}-1)}$
1114* The $\chi^2$ value
1115
1116So the probability of rejecting the null hypothesis, under the assumption that it is true can be determined using a predefined function.
1117
1118#### Monte Carlo Simulation
1119
1120
1121```r
1122chisq.test(x = eel_Mat, simulate.p.value = TRUE, B = 10^4)
1123```
1124
1125```
1126##
1127## Pearson's Chi-squared test with simulated p-value (based on 10000 replicates)
1128##
1129## data: eel_Mat
1130## X-squared = 6.2621, df = NA, p-value = 0.0459
1131```
1132
1133#### Analytic Function
1134
1135```r
1136chisq.test(eel_Mat)
1137```
1138
1139```
1140##
1141## Pearson's Chi-squared test
1142##
1143## data: eel_Mat
1144## X-squared = 6.2621, df = 2, p-value = 0.04367
1145```
1146
1147### Summary
1148
1149To determine whether or not the two species are distributed differently:
1150
1151
1152```r
1153matrix(c(264, 161, 127, 116, 99, 67), nrow = 2) %>% chisq.test()
1154```
1155
1156```
1157##
1158## Pearson's Chi-squared test
1159##
1160## data: .
1161## X-squared = 6.2621, df = 2, p-value = 0.04367
1162```
1163
1164
1165## Other Examples
1166
1167### (1) Daily Frequency
1168
1169ABC bank has determined the following counts of ATM use, is there any evidence to suggest that the spread is not equal?
1170
1171| Mon | Tues | Wed | Thur | Fri |
1172| --- | ---| --- | --- | --- |
1173| 253 | 197 | 204 | 279 | 267 |
1174
1175#### Hypothesis
1176
11771. $H_0:\quad x_1 = x_2 = ... x_5$
11782. $H_a:\quad x_i \neq x_j \enspace \exists i,j \in \left\{1,2,3,4,5\right\}$
1179
1180#### Rejection Region
1181
1182
1183```r
1184atm_usage <- c(253, 197, 204, 279, 267)
1185chisq.test(x = atm_usage)
1186```
1187
1188```
1189##
1190## Chi-squared test for given probabilities
1191##
1192## data: atm_usage
1193## X-squared = 23.183, df = 4, p-value = 0.0001164
1194```
1195
1196```r
1197chisq.test(x = atm_usage, simulate.p.value = TRUE, B = 10^3)
1198```
1199
1200```
1201##
1202## Chi-squared test for given probabilities with simulated p-value (based on 1000 replicates)
1203##
1204## data: atm_usage
1205## X-squared = 23.183, df = NA, p-value = 0.000999
1206```
1207
1208The $p$ -value is less than 5% and so the null hypothesis is rejected and it is concluded that the usage differs between days of usage.
1209
1210
1211### (2) Mendellian Genetics
1212
1213Clasp you hands together. Which thumb is on top? Everyone has two copies of a gene which
1214determines which thumb is most comfortable on top, the variants can be labelled L and R, an
1215individual is either LL, LR, or RR. Counts of 65 children found,
1216
1217| LL | LR | RR |
1218|----|----|----|
1219| 14 | 31 | 20 |
1220
1221According to Mendelian genetics 25% should be LL, 50% LR and 25% RR.
1222Use chisq.test to see if the data is consistent with this hypothesis.
1223
1224#### Hypotheses
1225
12261. $H_0:\quad \mathsf{LL}=\mathsf{RR}=0.25; \enspace \mathsf{LR}=0.5$
12272. $H_a:\quad$ The distribution differs from 0.25, 0.25, 0.5.
1228
1229#### Rejection Region
1230
1231In order to deal with this type of hypothesis, it is necessary to pass the probabilities to the function as a seperate argument:
1232
1233
1234```r
1235gene_hand <- c(14, 31, 20)
1236chisq.test(gene_hand, p = c(0.25, 0.5, 0.25))
1237```
1238
1239```
1240##
1241## Chi-squared test for given probabilities
1242##
1243## data: gene_hand
1244## X-squared = 1.2462, df = 2, p-value = 0.5363
1245```
1246
1247The $p$ -value is 50%, this indicates that the probability of rejecting the null hypothesis, if it was true and hence commiting a Type I error is quite high.
1248
1249Hence the null hypothesis is not rejected and it is concluded that:
1250
1251* There is insufficient evidence to reject the hypothesis that the hand usage are distributed in a way consistent with *Mendellian* genetics.
1252
1253
1254### (3) Political Opinion
1255300 adults were asked whether school teachers should be given more freedom to punish unruly
1256students. The following results were obtained?
1257
1258| | In Favour | Against | No Opinion |
1259| --- | --- | --- | --- |
1260| Men | 93 | 70 | 12 |
1261| Women | 87 | 32 | 6 |
1262
1263
1264Do men and women have the same distribution of opinions?
1265
1266#### Rejection Region
1267
1268By default a Chi-Squared test in **_R_** will compare whether or not rows (observations) in a matrix have the same distribution.
1269
1270This is distinct from Question 2 above where the question was:
1271
1272* Is the observation distributed at the speciefied frequency
1273
1274In this case the question is:
1275
1276* Are the two cases distributed with the same frequency
1277 * Assuming that they are means that the true distribution will be the mean value of the two observations.
1278
1279
1280```r
1281opinion_punish <- matrix(c(93, 87, 70, 32, 12, 6), nrow = 2)
1282
1283chisq.test(opinion_punish)
1284```
1285
1286```
1287##
1288## Pearson's Chi-squared test
1289##
1290## data: opinion_punish
1291## X-squared = 8.2528, df = 2, p-value = 0.01614
1292```
1293
1294The $p$ -value is 2% indicating the probability of rejecting the null-hypothesis if it were true is quite low, hence the null hypothesis is rejected and it is concluded that men and women have differing distributions of opinions.
1295
1296### (4) Medical Efficacy
1297
1298Two drugs are administered to patients to treat the same disease.
1299
1300| | Cured | Not Cured |
1301| --- | --- | --- |
1302| Drug A | 44 | 16 |
1303| Drug B | 18 | 22 |
1304
1305Are the drugs equally effective?
1306
1307#### Rejection Region
1308
1309This is the same as the eels or men/women problem, the drugs are rows / observations of different classes and the columns will be the outcome of the treatment:
1310
1311
1312
1313```r
1314drug_effect <- matrix(c(44, 18, 16, 22), nrow = 2)
1315rownames(drug_effect) <- c("Drug A", "Drug B")
1316colnames(drug_effect) <- c("Cured", "Not Cured")
1317
1318chisq.test(drug_effect)
1319```
1320
1321```
1322##
1323## Pearson's Chi-squared test with Yates' continuity correction
1324##
1325## data: drug_effect
1326## X-squared = 7.0193, df = 1, p-value = 0.008064
1327```
1328
1329The p-value is <1% indicating that the probability of rejecting the null hypothesis, under the assumption that it is true, is very small.
1330
1331Hence the null hypothesis is rejected and it is concluded that the drugs differ in effectiveness.
1332```
1333---
1334title: 02 Chi Distribution
1335tags: [DataSci, Notebooks/University/Thinking About Data, Statistics]
1336created: '2020-06-15T00:23:19.200Z'
1337modified: '2020-06-15T00:23:19.200Z'
1338---
1339
1340> Related: [[Normal-Theory-Chi.md]]
1341
1342
1343
1344# Chi Distribution
1345
1346## Preamble
1347
1348
1349```r
1350load.pac <- function() {
1351
1352 if(require("pacman")){
1353 library(pacman)
1354 }else{
1355 install.packages("pacman")
1356 library(pacman)
1357 }
1358
1359 pacman::p_load(xts, sp, gstat, ggplot2, rmarkdown, reshape2, ggmap, wesanderson,
1360 parallel, dplyr, plotly, tidyverse, reticulate, UsingR, Rmpfr, latex2exp,
1361 mise)
1362
1363# devtools::install_github("tidyverse/tidyverse")
1364}
1365
1366load.pac()
1367```
1368
1369```
1370## Loading required package: pacman
1371```
1372
1373```r
1374mise()
1375```
1376
1377## Wellness Data (Difference From Control)
1378
1379### (01) Enter Data
1380
1381Now be really careful here, make sure that the column names you choose are:
1382
1383* [Syntactically Correct](https://rdrr.io/r/base/make.names.html) [^style]
1384 + Style Guides recommend lower case, `snake_case` for variable and function names (using nouns in the prior and verbs in the latter), this would include vectors.
1385 + Try to avoid using dots, the S3 scheme for defining classes uses `.`'s so you might end up with a confusing method like `as.data.frame.data.frame()`
1386 + I'm using camelCase for DF column names in order to distinguish them from variables and make them clearer inside `ggplot` as opposed to `snake_case`.
1387* Spelt correctly
1388 + You will need matching column names in order to use things like `pivot_longer` etc.
1389* Short enough to type in without making a spelling Mistake
1390 + Same as above, remenber axis-titles and so forth are distinct from data frame names.
1391
1392 [^style]: [Style Guide](http://r-pkgs.had.co.nz/style.html)
1393
1394
1395```r
1396iraqi = c(123, 70, 93, 157)
1397names(iraqi) = c("low", "moderate", "high", "veryHigh")
1398head(iraqi)
1399```
1400
1401```
1402## low moderate high veryHigh
1403## 123 70 93 157
1404```
1405
1406```r
1407str(iraqi)
1408```
1409
1410```
1411## Named num [1:4] 123 70 93 157
1412## - attr(*, "names")= chr [1:4] "low" "moderate" "high" "veryHigh"
1413```
1414
1415### BarPlot
1416
1417#### Base Plot
1418
1419
1420```r
1421barplot(iraqi)
1422```
1423
1424
1425
1426#### GGPlot2
1427
1428##### Tidy Data
1429
1430This can be done in ggplot2, but first a `tidy` data frame needs to be constructed. Tidy data satisfies the following three rules:
1431
14321. Each variable will have it's own column
14332. Each ovservation will have it's own row
14343. Each Value will have it's own cell.
1435
1436A tidy data frame somewhat depends on context, for example:
1437
1438* If only the refugees were sampled, the data would be structured where:
1439 + Observations would be:
1440 + Low
1441 + Medium
1442 + High
1443 + Variables would be:
1444 + Count
1445* If Multiple populations were sampled, for example the Australian population and the iraqi populations, the tidy data frame may be such that:
1446 + Observations would be:
1447 + Australia
1448 + Iraqi
1449 + variables would be:
1450 + Count
1451 + Distress Level
1452* If Individuals were classified into a distress category, the corresponding tidy data frame would be:
1453 + Observations would be:
1454 + Individuals
1455 + variables would be:
1456 + Country of Origin/Residence
1457 + Distress Level
1458
1459##### Make the Plot
1460
1461
1462```r
1463#pivot_longer(data = iraqi, cols = names(iraqi))
1464iraqi.tidy <- melt(iraqi, value.name = "Count") %>% as_tibble(rownames = "Distress")
1465iraqi.tidy
1466```
1467
1468```
1469## # A tibble: 4 x 2
1470## Distress Count
1471## <chr> <dbl>
1472## 1 low 123
1473## 2 moderate 70
1474## 3 high 93
1475## 4 veryHigh 157
1476```
1477
1478```r
1479# Base
1480#barplot(height = iraqi.tidy$Count, names.arg = iraqi.tidy$Distress)
1481
1482# GGPlot2
1483ggplot(data = iraqi.tidy, mapping = aes(x = Distress, y = Count)) +
1484 geom_col()
1485```
1486
1487
1488
1489##### Fix the Order of the Plot
1490In making this plot you may observe that the order of the plot has been made to be alphabetical, the order of the data frame has been disregarded.
1491
1492This is desirable and expected behaviour, the values of distress have not been correctly encoded, they need to be encoded as ordered factors in order to be ordered, unordered factors may as well be placed in alphabetical order.
1493
1494
1495```r
1496iraqi.tidy <- melt(iraqi, value.name = "Count") %>% as_tibble(rownames = "Distress")
1497iraqi.tidy$Distress <- factor(iraqi.tidy$Distress, levels = iraqi.tidy$Distress, ordered = TRUE)
1498iraqi.tidy
1499```
1500
1501```
1502## # A tibble: 4 x 2
1503## Distress Count
1504## <ord> <dbl>
1505## 1 low 123
1506## 2 moderate 70
1507## 3 high 93
1508## 4 veryHigh 157
1509```
1510
1511```r
1512# Base
1513fillCols <- RColorBrewer::brewer.pal(nrow(iraqi.tidy), name = "Pastel1")
1514barplot(height = iraqi.tidy$Count, names.arg = c("Low", "Moderate", "High", "Very High"), col = fillCols, main = "Distress Levels of Iraqi Refugees", xlab = "Distress Level", ylab = "Count")
1515```
1516
1517
1518
1519```r
1520# GGPlot2
1521ggplot(data = iraqi.tidy, mapping = aes(x = Distress, y = Count)) +
1522 geom_col(mapping = aes(col = Count, fill = Distress)) +
1523 theme_classic() +
1524 labs(title = "Distress Levels of Iraqi Refugees") +
1525 guides(col = FALSE)
1526```
1527
1528
1529
1530### (02) Enter the AIHW Data
1531
1532The *Australian Institute of Health and Wellness* data are as follows:
1533
1534
1535```r
1536aihw <- c("low" = 70.65, "moderate" = 18.5, "high" = 7.41, "veryHigh" = 3.43)
1537aihw.tidy <- tibble::enframe(aihw) # %>% cbind(aihw, iraqi)
1538names(aihw.tidy) <- c("Distress", "Count")
1539aihw.tidy$Distress <- factor(x = aihw.tidy$Distress, levels = aihw.tidy$Distress, ordered = TRUE)
1540aihw.tidy
1541```
1542
1543```
1544## # A tibble: 4 x 2
1545## Distress Count
1546## <ord> <dbl>
1547## 1 low 70.6
1548## 2 moderate 18.5
1549## 3 high 7.41
1550## 4 veryHigh 3.43
1551```
1552
1553#### Combine all Observations
1554
1555Ideally all the data should be combined into a single data set, be mindful that `pivot_longer()` is gonna complain if columns with the same name have different data types, so make sure to remember to re-class categories as factors rather than factors.:
1556
1557
1558```r
1559# First add a variable that can be used to distinguish the two Data Sets
1560iraqi.tidy$Region <- "Iraq"
1561aihw.tidy$Region <- "Australia"
1562# Combine the Data Sets
1563all.tidy <- rbind(iraqi.tidy, aihw.tidy)
1564all.tidy
1565```
1566
1567```
1568## # A tibble: 8 x 3
1569## Distress Count Region
1570## <ord> <dbl> <chr>
1571## 1 low 123 Iraq
1572## 2 moderate 70 Iraq
1573## 3 high 93 Iraq
1574## 4 veryHigh 157 Iraq
1575## 5 low 70.6 Australia
1576## 6 moderate 18.5 Australia
1577## 7 high 7.41 Australia
1578## 8 veryHigh 3.43 Australia
1579```
1580
1581```r
1582all.tidy$Distress <- factor(all.tidy$Distress, levels = iraqi.tidy$Distress, ordered = TRUE)
1583
1584# Use Pivot Wider in order to make the column names the Region and the variable the count
1585all.wide <- pivot_wider(data = all.tidy, names_from = Region, values_from = Count)
1586all.wide
1587```
1588
1589```
1590## # A tibble: 4 x 3
1591## Distress Iraq Australia
1592## <ord> <dbl> <dbl>
1593## 1 low 123 70.6
1594## 2 moderate 70 18.5
1595## 3 high 93 7.41
1596## 4 veryHigh 157 3.43
1597```
1598
1599#### Plot the aihw Data
1600
1601
1602```r
1603# Base Plot
1604
1605fillCols <- RColorBrewer::brewer.pal(nrow(iraqi.tidy), name = "Pastel2")
1606barplot(height = all.wide$Australia, names.arg = c("Low", "Moderate", "High", "Very High"), col = fillCols, main = "Distress Levels of Iraqi Refugees", xlab = "Distress Level", ylab = "Count")
1607```
1608
1609
1610
1611```r
1612## ggplot2
1613ggplot(data = all.tidy, mapping = aes(x = Distress, y = Count, fill = Region, col = Count)) +
1614 geom_col(position = "dodge") +
1615 guides(col = FALSE) +
1616 theme_classic() +
1617 labs(title = "Distress of Refugees", subtitle = "", y = "Frequency")
1618```
1619
1620
1621
1622### (03) Determine Expected Frequency
1623
1624#### Hypothesis
1625
16261. $H_0 \enspace : \quad$ The Refugee Distress Categories will have frequencies equal to Australia
16272. $H_a \enspace : \quad$ There will be a difference between the categories
1628
1629Assuming that the null hypothesis is true, the expected frequency of the categories can be deterimed:
1630
1631$$
1632\textsf{e} = 443 \times \frac{\texttt{aihw}}{100}
1633$$
1634
1635```r
1636all.wide$IraqExpected <- all.wide$Australia * (sum(all.wide$Iraq)/100)
1637
1638# Rename the Column to reflect expected and observed frequencies
1639
1640## using Dplyer
1641all.wide %>%
1642 dplyr::rename(
1643 IraqObserved = Iraq
1644 )
1645```
1646
1647```
1648## # A tibble: 4 x 4
1649## Distress IraqObserved Australia IraqExpected
1650## <ord> <dbl> <dbl> <dbl>
1651## 1 low 123 70.6 313.
1652## 2 moderate 70 18.5 82.0
1653## 3 high 93 7.41 32.8
1654## 4 veryHigh 157 3.43 15.2
1655```
1656
1657```r
1658## using Base Functions
1659names(all.wide)[names(all.wide)=="Iraq"] <- "IraqObserved"
1660
1661# Print the DataFrame
1662all.wide
1663```
1664
1665```
1666## # A tibble: 4 x 4
1667## Distress IraqObserved Australia IraqExpected
1668## <ord> <dbl> <dbl> <dbl>
1669## 1 low 123 70.6 313.
1670## 2 moderate 70 18.5 82.0
1671## 3 high 93 7.41 32.8
1672## 4 veryHigh 157 3.43 15.2
1673```
1674
1675
1676### (04)Compute the Chi-Squared Distance
1677
1678The *Chi-Squared* ($\chi^2$) statistic is the squared distance from the the expected and observed values to the expected value:
1679
1680$$
1681\chi^2 = \sum^n_{i=1} \left[ \frac{(o-e)^2}{e} \right]
1682$$
1683
1684This can be done readily in **_R_**:
1685
1686
1687```r
1688o <- all.wide$IraqObserved
1689e <- all.wide$IraqExpected
1690
1691all.wide$ChiDist <- ((o-e)^2/e)
1692ChiStat <- sum(all.wide$ChiDist)
1693ChiStat
1694```
1695
1696```
1697## [1] 1550.75
1698```
1699
1700And returns the value $\chi^2 \approx 1551$
1701
1702### (05) Similate the Counts
1703
1704A distribution with multiple categories of different probabilities is a **multinomial** distribution and can be simulated:
1705
1706
1707```r
1708rmultinom(n = 1, size = 443, prob = aihw/100)
1709```
1710
1711```
1712## [,1]
1713## low 316
1714## moderate 86
1715## high 25
1716## veryHigh 16
1717```
1718
1719```r
1720# A more Rigurous simulation by averaging various other simulations
1721average_sim_count <- rmultinom(n = 10000, size = 443, prob = aihw/100) %>% rowMeans()
1722all.wide$IraqSimulated <- average_sim_count
1723all.wide
1724```
1725
1726```
1727## # A tibble: 4 x 6
1728## Distress IraqObserved Australia IraqExpected ChiDist IraqSimulated
1729## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
1730## 1 low 123 70.6 313. 115. 313.
1731## 2 moderate 70 18.5 82.0 1.74 82.0
1732## 3 high 93 7.41 32.8 110. 32.9
1733## 4 veryHigh 157 3.43 15.2 1323. 15.2
1734```
1735
1736```r
1737# Building a Confidence Interval for the true mean given this Distribution
1738# The confidence interval is meaningless really, it can be made arbitrarily small by
1739# making n sufficietnly large, this is just to illustrate confidence intervals for
1740# population means given a sample of sample means.
1741sd_sim_count <- rmultinom(n = 1000, size = 443, prob = aihw/100) %>% apply(MARGIN = 1, sd) # 1 is row, 2 is column
1742
1743# Calculate the t-statistic for 95%
1744t <- qt(p = 0.95, df = 999)
1745
1746data.frame(
1747 averageSimCount = average_sim_count,
1748 sdSimCount = sd_sim_count,
1749 lowerConfidenceForMean= round(average_sim_count - t * (sd_sim_count)/sqrt(1000)),
1750 upperConfidenceForMean = round(average_sim_count + t * (sd_sim_count)/sqrt(1000))
1751)
1752```
1753
1754```
1755## averageSimCount sdSimCount lowerConfidenceForMean upperConfidenceForMean
1756## low 312.8818 9.557677 312 313
1757## moderate 81.9786 8.321460 82 82
1758## high 32.9205 5.451160 33 33
1759## veryHigh 15.2191 3.780478 15 15
1760```
1761
1762Or this could be simulated by using the `sample()` function:
1763
1764
1765```r
1766frequency <- sample(1:4, 443, replace = TRUE, prob=all.wide$Australia) %>%
1767 table %>%
1768 tibble::enframe()
1769
1770names(frequency) <- c("Distress", "Count")
1771frequency$Distress <- names(iraqi)
1772frequency
1773```
1774
1775```
1776## # A tibble: 4 x 2
1777## Distress Count
1778## <chr> <table>
1779## 1 low 307
1780## 2 moderate 80
1781## 3 high 40
1782## 4 veryHigh 16
1783```
1784
1785### (06) Put Everything together
1786
1787So the idea is, in order to test the hypothesis that there is no difference we will set up a hypothesis test.
1788
1789The test statistic will be the total squared distance relative to the expected value for the distribution, this is known as the $\chi ^2$ value.
1790
1791If we can:
1792
1793* Assume the null hypothesis is true
1794 + (no difference between the number of iraqi counts and the number of `aihw` counts)
1795* With only 5% probability of getting a false positive under this assumption
1796
1797then we will reject the null hypothesis that they are the same and accept that there is a significant difference between the distribution of distress in the iraqi population.
1798 + (this necessarily show that the iraqi refugees are more distressed, merely that the distribution is different from the Australian distribution in such a way that would be unlikely to be a false positive assuming they were identical.)
1799
1800In order to evaluate this, We can take a random sample of values that are distributed with the same frequency as the `aihw` data and measure how often the corresponding $\chi^2$ value of the sampled values exceed that of the observed values for the iraqi values, under the assumption that the iraqi values are distributed at the same frequency of the `aihw`, this would amount to a False Positive for a difference.
1801
1802Creating many chi statistics, comparing them and repeating gives a false positive rate:
1803
1804
1805```r
1806# Calcualate the iraqi Chi Stat
1807e <- aihw/100 * sum(iraqi)
1808o <- iraqi
1809iraqi_chi <- sum((e-o)^2/e)
1810
1811# Simulate the False Positive Rate for data following the Aus Distribution
1812n <- 10^3 # Simulation Length
1813FalsePosVec <- vector(length = n)
1814for (i in 1:n) {
1815 # e <- rmultinom(1, 443, aihw/100)
1816 e <- sample(1:4, 443, replace = TRUE, prob=aihw) %>% table
1817 o <- aihw
1818# o <- iraqi
1819 sim_chi <- sum((e-o)^2/e )
1820 # Assume null hypotheses, this means we assume the iraqi distance does not exceed the sim distance
1821 # The number of times it does exceed is:
1822 # the probability of a false positive assuming the null hypothesis is correct
1823 # this is the p-value.
1824 FalsePosQ <- !(iraqi_chi < sim_chi) #(assuming null hypothesis means assuming iraqi distance leq to sim)
1825 # The number of positives will be the FPR and indicative of
1826 # the p-value)
1827 FalsePosVec[i] <- FalsePosQ
1828}
1829
1830pval <- mean(FalsePosVec)
1831print(pval)
1832```
1833
1834```
1835## [1] 1
1836```
1837
1838This could be made more efficient by using `replicate` rather than a `for` loop (`replicate` is to **_R_** as `Table[]` is to *Mathematica*):
1839
1840
1841```r
1842e <- aihw/100 * sum(iraqi)
1843o <- iraqi
1844iraqi_chi <- sum((e-o)^2/e)
1845
1846FalsePosCount <- replicate(10^4, expr = {
1847 o <- rmultinom(n = 1, size = sum(iraqi), prob = aihw)
1848 sim_chi <- sum((e-o)^2/e)
1849 !(iraqi_chi < sim_chi)
1850})
1851
1852mean(FalsePosCount)
1853```
1854
1855```
1856## [1] 1
1857```
1858
1859This returns a value of 1, indicating that the probability of a false positive, assuming that the data was randomly sampled following the probabilities of the proportions of the Australian populatation, is effectively 1, thus the null hypothesis should be rejected and the alternative hypothesis accepted.
1860
1861### (07) Use the InBuilt Chi-Square Statistic
1862
1863The chi-squared value that would correspond to a false positive rate like in the above simulation, may be determined by integrating the appropriate probability density function:
1864
1865$$
1866\begin{aligned}
1867f_n\left( x \right)= \frac{1}{2^{\frac{n}{2}} \cdot \Gamma\left( \frac{n}{2} \right)} \cdot x^{\frac{n}{2} - 1}\cdot e^{- \frac{x}{2}}
1868\end{aligned}
1869$$
1870
1871where the mean and variance are $n$ and $2n$ respectively; $\Gamma\left( x \right)$ is the gamma function it's very similar to the factorial function ($x!$):
1872
1873$$\begin{aligned}
1874x! &= \Gamma\left( x+ 1 \right)\\
1875x! &= x \cdot \Gamma\left( x \right)\\
1876\Gamma\left( n \right)&= \left( n- 1 \right)!, \qquad \forall n \in \mathbb{Z} \setminus \mathbb{Z}^- \\
1877\Gamma\left( z \right) &= \int_{0}^{\infty}\left( \left( \# \right)^{z- 1}\cdot e^{-\left( \# \right)} \right) \mathrm{d} \left( \# \right), \qquad z \in \mathbb{C} \wedge \Re\left( z \right)>0
1878\end{aligned}$$
1879
1880This doesn't seem quick to solve, plugging it into *Mathematica* gives:
1881
1882```wolfram
1883Integrate[ 1/2^(n/2*Gamma[n/2])*x^(n/2 - 1)*e^(-x/2), {x, 0, \[Infinity]}]
1884```
1885
1886```
1887ConditionalExpression[(2^(1/2 - (3 Sqrt[\[Pi]])/4) Sqrt[\[Pi]])/ Log[e]^(3/2), Re[Log[e]] > 0]
1888 ```
1889
1890 $$
1891\int^{\infty}_0\frac{1}{2^{\frac{n}{2}} \cdot \Gamma\left( \frac{n}{2} \right)} \cdot x^{\frac{n}{2} - 1}\cdot e^{- \frac{x}{2}} \enspace \mathrm{d}x = \frac{2^{\frac{1}{2}-\frac{3 \sqrt{\pi }}{4}} \sqrt{\pi }}{\log ^{\frac{3}{2}}(e)}
1892 $$
1893
1894Howerver inside **_R_** this is all built into the `pchisq()` function and the null hypothesis may be evaluated without necessarily undertaking the simulation.
1895
1896#### Evaluate Test Statistic Using Chi Statistic
1897
1898The probability of a false positive, assuming that the null hypothesis is true can be determined directly from the critical values of the Chi-Statistic.
1899
1900
1901
1902```r
1903# the null hypothesis is that there is no difference, the
1904# probability of detecting a difference will be the upper tail and would be the p-value
1905pchisq(q = iraqi_chi, df = (length(aihw)-1), lower.tail = FALSE)
1906```
1907
1908```
1909## [1] 0
1910```
1911
1912It isn't even necessary to calculate the $\chi^2$ value, this is built into **_R_** and can be done all at once:
1913
1914
1915```r
1916chisq.test(x = iraqi, p = aihw, rescale.p = TRUE)
1917```
1918
1919```
1920##
1921## Chi-squared test for given probabilities
1922##
1923## data: iraqi
1924## X-squared = 1550.6, df = 3, p-value < 2.2e-16
1925```
1926
1927As opposed to using the $\chi^2$ distribution, it is possible to use a *Monte Carlo* simulation all in one line as well:
1928
1929
1930```r
1931chisq.test(x = iraqi, p = aihw, rescale.p = TRUE, simulate.p.value = TRUE, B = 10^3)
1932```
1933
1934```
1935##
1936## Chi-squared test for given probabilities with simulated p-value (based on 1000 replicates)
1937##
1938## data: iraqi
1939## X-squared = 1550.6, df = NA, p-value = 0.000999
1940```
1941
1942I can't think of any reason to use the monte carlo simulation over the density distribution though
1943
1944
1945
1946
1947
1948
1949## Eels Data (Comparison of Two Populations)
1950
1951### (1) Enter the Data
1952
1953The data can be entered as a data frame or a matrix, the prior will be better for plotting and visualisation but the latter may be the expected format for various built in functions:
1954
1955
1956```r
1957# Create Vectors
1958g_moringa <- c("BoRdeR" = 264,"GraSs" = 127,"SaNd" = 99)
1959g_vicinus <- c("BoRder" = 161,"GraSs" = 116,"SaNd" = 67)
1960 # Capitals to emphasise change later with dimnames
1961
1962# Create a Matrix
1963eel_Mat <- rbind(g_moringa, g_vicinus)
1964dimnames(eel_Mat) <- list(species = c("G.moringa", "G.vicinus"),
1965 location = c("Border", "Grass", "Sand")
1966 )
1967
1968
1969# Create a Data Frame
1970eel_DF <- eel_Mat %>%
1971 as_tibble() %>%
1972 add_column("Species" = factor(rownames(eel_Mat))) %>%
1973 dplyr::select(Species, Border, Grass, Sand)
1974
1975eel_DF # %>% kable()
1976```
1977
1978```
1979## # A tibble: 2 x 4
1980## Species Border Grass Sand
1981## <fct> <dbl> <dbl> <dbl>
1982## 1 G.moringa 264 127 99
1983## 2 G.vicinus 161 116 67
1984```
1985
1986#### Plot the Data
1987
1988In order to plot the data a tidy data frame needs to be made using `tidyr::pivot_longer()` or `reshape2::melt()`.
1989A custom colour pallet can be specified using the following layers [^ggplotColLayer]:
1990
1991[^ggplotColLayer]: [ggplot2 colors : How to change colors automatically and manually? - Easy Guides - Wiki - STHDA](http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually)
1992
1993
1994* Discrete Data
1995 + `scale_fill_manual()`
1996 + to change the fill of the object
1997 + `scale_color_manual()`
1998 + to change the colour of the **outline** of the object
1999 + Using Built in Palletes:
2000 + `RColorBrewer`
2001 + `scale_fill_brewer(palette="Dark2")`
2002 + `scale_color_brewer(palette="Dark2")`
2003 + `wesanderson`
2004 + `scale_fill_manual(values=wes_palette(n=3, name="GrandBudapest"))`
2005 + `scale_color_(values=wes_palette(n=3, name="GrandBudapest"))`
2006* Continuous Data
2007 + `scale_color_gradient(low="blue", high="red")`
2008 + Color is the outline, so think Scatter Plot
2009 + To Have a Diverging Pallet:
2010 + `scale_color_gradient2`(midpoint=mid, low="blue", mid="white", high="red", space ="Lab" )
2011 + To have a pallet of n different colours (in this example 7:
2012 + `scale_color_gradientn(colours = rainbow(7))`
2013 + `scale_fill_gradient()`
2014 + Fill is the filling so think Histogram
2015
2016scale_color_gradient(low="blue", high="red")
2017
2018
2019
2020```r
2021# Create a Tidy Data Frame
2022
2023## Using Pivot Longer from `tidyverse` (dev git repo)
2024eel_DF_Tidy <- pivot_longer(data = eel_DF,
2025 cols = names(eel_DF[,-1]),
2026 names_to = "Location",
2027 values_to = "Count")
2028
2029eel_DF_Tidy$Species <- factor(eel_DF$Species)
2030eel_DF_Tidy$Location <- factor(eel_DF_Tidy$Location)
2031
2032## Using Melt from `reshape2`
2033melt(eel_DF, ) %>% dplyr::rename("Location" = variable,
2034 "Count" = value)
2035```
2036
2037```
2038## Using Species as id variables
2039```
2040
2041```
2042## Species Location Count
2043## 1 G.moringa Border 264
2044## 2 G.vicinus Border 161
2045## 3 G.moringa Grass 127
2046## 4 G.vicinus Grass 116
2047## 5 G.moringa Sand 99
2048## 6 G.vicinus Sand 67
2049```
2050
2051```r
2052 # Instead of using dplyr I could have used `variable.name=`...,
2053 # just done for reference
2054
2055## ggplot2
2056violetBluePallet <- c("#511FB5", "dodgerblue3", "#e31a1c")
2057ggplot(data = eel_DF_Tidy,
2058 mapping = aes(x = Location,
2059 y = Count,
2060 fill = Species,
2061 col = Count)) +
2062 geom_col(position = "dodge") +
2063 guides(col = FALSE) +
2064 theme_classic() +
2065 labs(title = "blah", subtitle = "", y = "Frequency") +
2066 scale_fill_manual(values=violetBluePallet)
2067```
2068
2069
2070
2071
2072### (2) Expected Values
2073
2074In this case the two hypothesis are:
2075
2076* $H_0: \quad$ the two species are distriuted with the same frequency
2077 + So the proportion in both areas is assumed to be equal
2078 + Given this assumption we may take the proportion of the total number
2079 of observations that occur in this area and this will be equal to averaging the proportions
2080 of each species that occur in the two areas.
2081* $H_a: \quad$ there is a difference between the distribution of the two frequencies
2082
2083
2084
2085Under the assumption that both species have the same distribution (i.e. assume $\mathrm{H}_0$ is true) each term $x_{ij}$ will have an expected frequency of $f = \frac{1}{n} \cdot \sum^{2}_{i= 1} \left[x_i \right]$ and hence the expected count would be the frequency multiplied by the total number of observed species:
2086
2087$$\begin{aligned}
2088 x_{ij}&= f \cdot \sum^{3}_{j= 1} \left[ x_{ij} \right] \\
2089&= \frac{1}{n} \cdot \sum^{2}_{i= 1} \left[x_i \right] \cdot \sum^{3}_{j= 1} \left[ x_{ij} \right]
2090\end{aligned}$$
2091
2092so the resulting matrix of counts would be:
2093
2094$$\begin{aligned}
2095 \begin{bmatrix} 490 \\ 344 \end{bmatrix}
2096 \times
2097 \begin{bmatrix} 0.5 & 0.29 & 0.2 \end{bmatrix} \\
2098 = \begin{bmatrix} 250 & 143 & 98 \\
2099 175 & 100 & 68 \end{bmatrix}
2100\end{aligned}$$
2101
2102
2103Assuming that the null hypothesis is true, the expected distribution between areas could be calculated by using matrix multiplcication:
2104
2105
2106```r
2107species_counts <- rowSums(eel_Mat)
2108location_proportions <- colSums(eel_Mat)/sum(eel_Mat)
2109
2110# Now Perform matrix Multiplication
2111 species_counts
2112```
2113
2114```
2115## G.moringa G.vicinus
2116## 490 344
2117```
2118
2119```r
2120 location_proportions
2121```
2122
2123```
2124## Border Grass Sand
2125## 0.5095923 0.2913669 0.1990408
2126```
2127
2128```r
2129 as.matrix(species_counts) %*% t(as.matrix(location_proportions))
2130```
2131
2132```
2133## Border Grass Sand
2134## G.moringa 249.7002 142.7698 97.52998
2135## G.vicinus 175.2998 100.2302 68.47002
2136```
2137
2138This is actually the definition of the outer product; the [*Outer Product*](https://en.wikipedia.org/wiki/Outer_product#Definition) is defined as:
2139
2140
2141
2142$$
2143\mathbf{u} \otimes \mathbf {v} =\mathbf {u} \mathbf {v} ^{\textsf {T}}={\begin{bmatrix}u_{1}\\u_{2}\\u_{3}\\u_{4}\end{bmatrix}}{\begin{bmatrix}v_{1}&v_{2}&v_{3}\end{bmatrix}}={\begin{bmatrix}u_{1}v_{1}&u_{1}v_{2}&u_{1}v_{3}\\u_{2}v_{1}&u_{2}v_{2}&u_{2}v_{3}\\u_{3}v_{1}&u_{3}v_{2}&u_{3}v_{3}\\u_{4}v_{1}&u_{4}v_{2}&u_{4}v_{3}\end{bmatrix}}.
2144$$
2145In **_R_** vectors [^vecspace] of length $m$ are treated as $m \times 1$ matrices as can be observed by evaluating `as.matrix(1:3)` and `t(as.matrix(1:3))`, this means that the outer product of two vectors will be equivalent to:
2146
2147
2148$$
2149\mathbf {u} \otimes \mathbf {v} =\mathbf {A} ={\begin{bmatrix}u_{1}v_{1}&u_{1}v_{2}&\dots &u_{1}v_{n}\\u_{2}v_{1}&u_{2}v_{2}&\dots &u_{2}v_{n}\\\vdots &\vdots &\ddots &\vdots \\u_{m}v_{1}&u_{m}v_{2}&\dots &u_{m}v_{n}\end{bmatrix}}
2150$$
2151
2152
2153
2154
2155[^vecspace]: In this context a matrix is vector in the sense that matrices can be used to satisfy the axiom of vector addition for a Vector Space, refer to 4.2 of R Larson's *Linear Algebra* 7th ed.
2156
2157
2158
2159So the expected occurence rate of the species would be:
2160
2161
2162```r
2163# Determine how many species there are
2164species_counts <- rowSums(eel_Mat)
2165species_proportions <- rowSums(eel_Mat)/n
2166# Determine the area proportions
2167location_proportions <- colSums(eel_Mat)/sum(eel_Mat)
2168
2169# Calculate the expected distribution of that number for those proportions
2170expected_counts <- base::outer(species_counts, location_proportions)
2171
2172print(list(species_counts, location_proportions, expected_counts, base::outer(location_proportions, species_counts) ))
2173```
2174
2175```
2176## [[1]]
2177## G.moringa G.vicinus
2178## 490 344
2179##
2180## [[2]]
2181## Border Grass Sand
2182## 0.5095923 0.2913669 0.1990408
2183##
2184## [[3]]
2185## Border Grass Sand
2186## G.moringa 249.7002 142.7698 97.52998
2187## G.vicinus 175.2998 100.2302 68.47002
2188##
2189## [[4]]
2190## G.moringa G.vicinus
2191## Border 249.70024 175.29976
2192## Grass 142.76978 100.23022
2193## Sand 97.52998 68.47002
2194```
2195
2196or if you're willing to remember that:
2197
2198$$
2199e_{ij} = \frac{\sum{[\textsf{row}]} * \sum{[\textsf{col}]}}{n}
2200$$
2201
2202
2203```r
2204e <- matrix(1:6, nrow = 2)
2205 for (i in 1:nrow(eel_Mat)) {
2206 for (j in 1:ncol(eel_Mat)) {
2207 e[i,j] <- colSums(eel_Mat)[j] * rowSums(eel_Mat)[i] / n
2208 }
2209 }
2210```
2211
2212
2213
2214### (3) Simulate The Values
2215
2216The game plan here is to:
2217
22181. Assume that the null hypothesis is true:
2219 + The observations are distributed equally across features:
2220 + $e_{ij} = \sum \left[ \textsf{rows} \times \right] \sum \left[ \textsf{cols} \right] \times \frac{1}{n}$
22212. Randomly sample values at the same probability
2222 + given this simulated observation, determine what the expected
2223 distribution would be determined to be assuming that the null
2224 hypothesis was true.
2225 + Determine what the corresponding $\chi^2$ value is.
2226 + the number of times that this simulated $\chi^2$ value is
2227 greater than the observed $\chi^2$ value is
2228 the _**F**alse **P**ositive **R**ate_
2229
2230#### Sample a Single Value
2231
2232In order to simulate the values we need simulate data distributed at given probabilities, this is known as a multinomial distribution, it's essentially rolling a really oddly lopsided die that matches the probabilities specified.
2233
2234From that sample it is necessary to calculate what we would determine the expected distribution to be assuming that the null hypothesis was true:
2235
2236
2237```r
2238overall_proportion <- expected_counts/sum(expected_counts)
2239
2240# Presuming that R's internal structure is consistent
2241rmultinom(n = 1, size = sum(eel_Mat), prob = overall_proportion) %>% matrix(ncol = 3, nrow = 2)
2242```
2243
2244```
2245## [,1] [,2] [,3]
2246## [1,] 229 153 96
2247## [2,] 184 102 70
2248```
2249
2250```r
2251# Presuming it's not
2252dist_prob <- overall_proportion %>% as.vector
2253obs_sim <- rmultinom(n = 1, size = sum(eel_Mat), prob = dist_prob) %>% matrix(ncol = 3, nrow = 2)
2254e_sim <- 1/n*outer(X = rowSums(obs_sim), colSums(obs_sim))
2255
2256print(list(obs_sim, e_sim), 1)
2257```
2258
2259```
2260## [[1]]
2261## [,1] [,2] [,3]
2262## [1,] 228 144 107
2263## [2,] 193 99 63
2264##
2265## [[2]]
2266## [,1] [,2] [,3]
2267## [1,] 202 116 81
2268## [2,] 149 86 60
2269```
2270
2271```r
2272## Sanity Check
2273#1:6 %>% matrix(nrow = 2, ncol =3) %>% as.vector()
2274#1:6 %>% as.matrix() %>% as.vector() %>% matrix(nrow = 2, ncol =3)
2275#1:6 %>% as.matrix() %>% as.vector() %>% matrix(nrow = 2, ncol =3) %>% as.vector()
2276#1:6 %>% as.matrix() %>% as.vector() %>% matrix(nrow = 2, ncol =3) %>% as.vector() %>% matrix(nrow = 2, ncol =3)
2277```
2278
2279If this was repeated many times over, the number of times that the $\chi^2$ statistic was sufficiently extreme to reject the null hypothesis would represent the false positive rate, which would be an acceptable estimate for the probability of a type I error, the $p$-value:
2280
2281> The probability of rejecting the null hypothesis under the assumption that it is true (i.e. under the assumption that there is no true effect). Careful, this is different from the false discovery rate
2282
2283This simulation is under the assumption that the null hypothesis is true and that the two populations are distributed equally, so the null hypothesis assumes that:
2284
2285$$
2286\begin{aligned}
2287&\mathrm{H}_0:\quad \chi^2_{obs} < \chi^2_{sim} \\
2288\end{aligned}
2289$$
2290
2291A False Positive would be an observation that violates that assumption, if the probability of a false positive, the $p$ -value:
2292
2293* Sufficiently small, the null hypothesis will be rejected.
2294* too high the null hypothesis will not be rejected.
2295
2296#### Simulate Samples of that frequency
2297
2298Calculate the Chi Distribution for the observations which will become the test statistic:
2299
2300
2301```r
2302# Create expected and observed vectors
2303e <- expected_counts
2304o <- eel_Mat
2305n <- sum(eel_Mat)
2306
2307chi_obs <- sum((e-o)^2/e)
2308
2309#obs_sim <- rmultinom(n = 1, size = sum(eel_Mat), prob = dist_prob) %>% matrix(ncol = 3, nrow = 2)
2310```
2311
2312
2313The idea of the simulation is to generate observations at the proportion assumed by the null hypothesis and reduce these matrices to corresponding $\chi^2$ values.
2314
2315
2316Simulate samples at the same proportion and sample the $\chi^2$ statistic:
2317
2318
2319```r
2320# Simulate distribution
2321 # Use `Replicate` not `for` because it's faster
2322
2323dist_prob <- overall_proportion %>% as.vector
2324 # I could also have used rmultinom to sample the split of the population across two species
2325 # Then split those species amont locations
2326 # or made two samples of the species and location dist and then used `table()`
2327
2328sim_chi_vec <- replicate(10^4, {
2329
2330 ## Simulate Samples
2331obs_sim <- rmultinom(n = 1, size = sum(eel_Mat), prob = dist_prob) %>% matrix(ncol = 3, nrow = 2)
2332
2333 ## Calculate the expected values from the sample
2334 ## Assuming the null hypothesis that both rows are equal.
2335e <- outer(rowSums(obs_sim), colSums(obs_sim))/n
2336
2337 ## Calculate the Chi Squared Statistic
2338sim_chi <- sum((e-obs_sim)^2/e)
2339sim_chi
2340
2341})
2342```
2343
2344If a simulated distribution had a $\chi^2$ value more extreme than the observation, the null hypothesis would be rejected, the simulation was generated under circumstanes where the null hypothesis was true and so this would be a false positive or a *Type I Error*.
2345
2346The rate of false positives is an estimator for the probability of commiting a Type I error (the $p$ -value), this can be calculated:
2347
2348
2349
2350```r
2351calculate_p_value <- function() {
2352 mean(falsepos())
2353}
2354
2355 falsepos <- function() {
2356 # If the critical value was our observation, would the simulation be less extreme?
2357 # If not this is a false positive
2358 # If the simulated value is more extreme than the
2359 !(sim_chi_vec > chi_obs)
2360 sim_chi_vec > chi_obs
2361 }
2362
2363p <- calculate_p_value()
2364
2365paste("The Probability of rejecting the null hypothesis, assuming that it was true (i.e. detecting a false positive assuming there is no difference between species) is", p) %>% print()
2366```
2367
2368```
2369## [1] "The Probability of rejecting the null hypothesis, assuming that it was true (i.e. detecting a false positive assuming there is no difference between species) is 0.0403"
2370```
2371
2372```r
2373"This is too high and so the null hypothesis is not rejected." %>% print()
2374```
2375
2376```
2377## [1] "This is too high and so the null hypothesis is not rejected."
2378```
2379
2380Hence the probability of rejecting the null hypothesis when it is true is quite small, the $p$ -value is less than 5% and so the null hypothesis is rejected and the alternative hypothesis, that the species are distributed differently is accepted.
2381
2382This can be visualised in a histogram:
2383
2384
2385
2386```r
2387# Plot a Histogram
2388
2389 ## Base Plot
2390 myhist <- hist(sim_chi_vec, breaks = 50)
2391```
2392
2393
2394
2395```r
2396 plot(myhist, col = ifelse(myhist$mid<6, "white", "lightblue"),
2397 main= latex2exp::TeX("Simulated $\\chi^2$ Distances"),
2398 freq = FALSE,
2399 xlab = TeX("$\\chi^2$ distance"))
2400 abline(v = chi_obs, col = "Indianred", lty = 2, lwd = 3)
2401```
2402
2403
2404
2405```r
2406 # Conditional Colour: # https://stat.ethz.ch/pipermail/r-help/2008-July/167936.html
2407
2408 ## GGPlot2
2409
2410 ### Make a Tidy Data Frame
2411 chi_vals_Tib <- tibble::enframe(sim_chi_vec)
2412 chi_vals_Tib$name[sim_chi_vec<chi_obs] <- "TruePos"
2413 chi_vals_Tib$name[sim_chi_vec>chi_obs] <- "FalsePos"
2414
2415 ### Plot the Data
2416 ggplot(data = chi_vals_Tib, mapping = aes(x = value))+
2417 geom_histogram(binwidth = 0.4, col = "white", aes(fill = name)) +
2418 theme_classic() +
2419 guides(fill = guide_legend("Conclustion\nStatus")) +
2420 geom_vline(xintercept = chi_obs, lty = 3, col = "purple") +
2421 scale_fill_manual(values = c("indianred", "lightblue"),
2422 labels = c("False Positive / Type I", "True Positive")
2423 ) +
2424 labs(title = latex2exp::TeX("Simulated $\\chi^2$ Values")) +
2425 scale_x_continuous(limits = c(0, 11.5))
2426```
2427
2428```
2429## Warning: Removed 32 rows containing non-finite values (stat_bin).
2430```
2431
2432```
2433## Warning: Removed 4 rows containing missing values (geom_bar).
2434```
2435
2436
2437
2438
2439
2440### (4) Use the Chi Squared Distribution
2441
2442The $p$-value is a function of:
2443
2444* The degrees of freedom
2445 + $\mathsf{d.f. = (\mathsf{rows}-1) \times (\mathsf{cols}-1)}$
2446* The $\chi^2$ value
2447
2448So the probability of rejecting the null hypothesis, under the assumption that it is true can be determined using a predefined function.
2449
2450#### Monte Carlo Simulation
2451
2452
2453```r
2454chisq.test(x = eel_Mat, simulate.p.value = TRUE, B = 10^4)
2455```
2456
2457```
2458##
2459## Pearson's Chi-squared test with simulated p-value (based on 10000 replicates)
2460##
2461## data: eel_Mat
2462## X-squared = 6.2621, df = NA, p-value = 0.0459
2463```
2464
2465#### Analytic Function
2466
2467```r
2468chisq.test(eel_Mat)
2469```
2470
2471```
2472##
2473## Pearson's Chi-squared test
2474##
2475## data: eel_Mat
2476## X-squared = 6.2621, df = 2, p-value = 0.04367
2477```
2478
2479### Summary
2480
2481To determine whether or not the two species are distributed differently:
2482
2483
2484```r
2485matrix(c(264, 161, 127, 116, 99, 67), nrow = 2) %>% chisq.test()
2486```
2487
2488```
2489##
2490## Pearson's Chi-squared test
2491##
2492## data: .
2493## X-squared = 6.2621, df = 2, p-value = 0.04367
2494```
2495
2496
2497## Other Examples
2498
2499### (1) Daily Frequency
2500
2501ABC bank has determined the following counts of ATM use, is there any evidence to suggest that the spread is not equal?
2502
2503| Mon | Tues | Wed | Thur | Fri |
2504| --- | ---| --- | --- | --- |
2505| 253 | 197 | 204 | 279 | 267 |
2506
2507#### Hypothesis
2508
25091. $H_0:\quad x_1 = x_2 = ... x_5$
25102. $H_a:\quad x_i \neq x_j \enspace \exists i,j \in \left\{1,2,3,4,5\right\}$
2511
2512#### Rejection Region
2513
2514
2515```r
2516atm_usage <- c(253, 197, 204, 279, 267)
2517chisq.test(x = atm_usage)
2518```
2519
2520```
2521##
2522## Chi-squared test for given probabilities
2523##
2524## data: atm_usage
2525## X-squared = 23.183, df = 4, p-value = 0.0001164
2526```
2527
2528```r
2529chisq.test(x = atm_usage, simulate.p.value = TRUE, B = 10^3)
2530```
2531
2532```
2533##
2534## Chi-squared test for given probabilities with simulated p-value (based on 1000 replicates)
2535##
2536## data: atm_usage
2537## X-squared = 23.183, df = NA, p-value = 0.000999
2538```
2539
2540The $p$ -value is less than 5% and so the null hypothesis is rejected and it is concluded that the usage differs between days of usage.
2541
2542
2543### (2) Mendellian Genetics
2544
2545Clasp you hands together. Which thumb is on top? Everyone has two copies of a gene which
2546determines which thumb is most comfortable on top, the variants can be labelled L and R, an
2547individual is either LL, LR, or RR. Counts of 65 children found,
2548
2549| LL | LR | RR |
2550|----|----|----|
2551| 14 | 31 | 20 |
2552
2553According to Mendelian genetics 25% should be LL, 50% LR and 25% RR.
2554Use chisq.test to see if the data is consistent with this hypothesis.
2555
2556#### Hypotheses
2557
25581. $H_0:\quad \mathsf{LL}=\mathsf{RR}=0.25; \enspace \mathsf{LR}=0.5$
25592. $H_a:\quad$ The distribution differs from 0.25, 0.25, 0.5.
2560
2561#### Rejection Region
2562
2563In order to deal with this type of hypothesis, it is necessary to pass the probabilities to the function as a seperate argument:
2564
2565
2566```r
2567gene_hand <- c(14, 31, 20)
2568chisq.test(gene_hand, p = c(0.25, 0.5, 0.25))
2569```
2570
2571```
2572##
2573## Chi-squared test for given probabilities
2574##
2575## data: gene_hand
2576## X-squared = 1.2462, df = 2, p-value = 0.5363
2577```
2578
2579The $p$ -value is 50%, this indicates that the probability of rejecting the null hypothesis, if it was true and hence commiting a Type I error is quite high.
2580
2581Hence the null hypothesis is not rejected and it is concluded that:
2582
2583* There is insufficient evidence to reject the hypothesis that the hand usage are distributed in a way consistent with *Mendellian* genetics.
2584
2585
2586### (3) Political Opinion
2587300 adults were asked whether school teachers should be given more freedom to punish unruly
2588students. The following results were obtained?
2589
2590| | In Favour | Against | No Opinion |
2591| --- | --- | --- | --- |
2592| Men | 93 | 70 | 12 |
2593| Women | 87 | 32 | 6 |
2594
2595
2596Do men and women have the same distribution of opinions?
2597
2598#### Rejection Region
2599
2600By default a Chi-Squared test in **_R_** will compare whether or not rows (observations) in a matrix have the same distribution.
2601
2602This is distinct from Question 2 above where the question was:
2603
2604* Is the observation distributed at the speciefied frequency
2605
2606In this case the question is:
2607
2608* Are the two cases distributed with the same frequency
2609 * Assuming that they are means that the true distribution will be the mean value of the two observations.
2610
2611
2612```r
2613opinion_punish <- matrix(c(93, 87, 70, 32, 12, 6), nrow = 2)
2614
2615chisq.test(opinion_punish)
2616```
2617
2618```
2619##
2620## Pearson's Chi-squared test
2621##
2622## data: opinion_punish
2623## X-squared = 8.2528, df = 2, p-value = 0.01614
2624```
2625
2626The $p$ -value is 2% indicating the probability of rejecting the null-hypothesis if it were true is quite low, hence the null hypothesis is rejected and it is concluded that men and women have differing distributions of opinions.
2627
2628### (4) Medical Efficacy
2629
2630Two drugs are administered to patients to treat the same disease.
2631
2632| | Cured | Not Cured |
2633| --- | --- | --- |
2634| Drug A | 44 | 16 |
2635| Drug B | 18 | 22 |
2636
2637Are the drugs equally effective?
2638
2639#### Rejection Region
2640
2641This is the same as the eels or men/women problem, the drugs are rows / observations of different classes and the columns will be the outcome of the treatment:
2642
2643
2644
2645```r
2646drug_effect <- matrix(c(44, 18, 16, 22), nrow = 2)
2647rownames(drug_effect) <- c("Drug A", "Drug B")
2648colnames(drug_effect) <- c("Cured", "Not Cured")
2649
2650chisq.test(drug_effect)
2651```
2652
2653```
2654##
2655## Pearson's Chi-squared test with Yates' continuity correction
2656##
2657## data: drug_effect
2658## X-squared = 7.0193, df = 1, p-value = 0.008064
2659```
2660
2661The p-value is <1% indicating that the probability of rejecting the null hypothesis, under the assumption that it is true, is very small.
2662
2663Hence the null hypothesis is rejected and it is concluded that the drugs differ in effectiveness.