Lecture 2: Generalized linear models

class: center, inverse
background-image: url("main2.png")
background-size: contain
---
<style type="text/css">
body, td {
 font-size: 15px;
}
code.r{
 font-size: 15px;
}
pre {
 font-size: 20px
}
.huge .remark-code { /*Change made here*/
 font-size: 200% !important;
}
.tiny .remark-code { /*Change made here*/
 font-size: 80% !important;
}
</style>

## Press record

---

## Corrections from previous lecture
1. Intercept in the model with Age
 
 
2. Anova comparison

---

## First meeting exercises (Friday)
Did you manage to do any of those?

---

##R code for this lecture

[Link](https://nvaci.github.io/Lecture_2_code/Lecture2_Rcode)

---

## What we know
.pull-left[
Gaussian distribution:
`$y_{i} \sim \mathcal{N}(\mu_{i},\sigma)$` 
`$\mu_{i} = \alpha+\beta*x_i$` 
Model mean and standard deviation 
 
 
a) Errors: `$\mathcal{N}^{iid}(0,\sigma^2)$` 
b) Linearity and additivity 
c) Validity 
d) Lack of perfect multicolinearity 
]

.pull-right[
<img src="gauss.png", width = "120%"> 
Google Doodle: Johann Carl Friedrich Gauß’s 241st Birthday
]

---

## What are these data? 
.center[
<img src="image.png", width = "50%">
]
---

## Probability distributions

Exponential family distribution: 
.pull-left[Continous outcomes: 
Categorical outcomes (Yes/No or multiple outcomes):
 
Counted outcomes (Success, number of multiple outcomes, occurences in continuous domains): 
 
+reals data (always positive):
]
.pull-right[
Normal (Gaussian) distribution 
Bernoulli distribution, Categorical distribution

Binomial distribution, Multinomial distribution, Poisson distribution

Exponential, Gamma or Inverse Gaussian 
]
---
## Exponential family
.center[
<img src="Exp1.png", width = "90%"> 
]

---
## Exponential family
.center[
<img src="Exp2.png", width = "90%"> 
]

---
## Exponential family
.center[
<img src="Exp3.png", width = "90%"> 
]
---
## Exponential family
.center[
<img src="Exp4.png", width = "90%"> 
]
---
## Exponential family
.center[
<img src="Exp5.png", width = "90%"> 
]
???
We can use all these functions as an outcome in the generalized linear models 
Poisson - counts with no upper bound 
Gamma - always positive, time to an event with multiple components (eg. age onset of cancer) 
Many of these distributions will converge to Normal: Binomial if the probability is far from edge, Poisson and Gamma if the mean is large - remember central limit theorem 
https://en.wikipedia.org/wiki/Exponential_family 
https://i.stack.imgur.com/HgpO4.jpg 
Check out amazing lectures from Richard McElreath (this lecture uses his idea extensively): [youtube channel](https://www.youtube.com/channel/UCNJK6_DZvcMqNSzQdEkzvzA)
---

## Binomial distribution
A coint is flipped 100 times, what is the probability that heads up results in 75 times?

Number of successes of n number of independent Bernoulli trials -> Binomial distribution: 
`$P(X=x)=(^n_x)p^x*(1-p)^{n-x}$` 
`$\mu=n*p$` `$\sigma=n*p(1-p)$`
---

## Babies

```
## 
##  0  1 
## 37 63
```

```r
par(mfrow=c(1,1), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16)
plot(1:100,dbinom(1:100,0.63, size=100), xlab='N of babies crawling', ylab='Probability', type='l')
```

---

## Babies

`$\mu=100*0.63=63$` `$\sigma=\sqrt(100*0.63*(1-0.63))=4.82$`

Exactly:

```r
x=75
choose(100, x)*0.63^x*(1-0.63)^(100-x)
```

```
## [1] 0.003470009
```

At least:

```r
x=64:100
sum(choose(100, x)*0.63^x*(1-0.63)^(100-x))
```

```
## [1] 0.4623385
```

---

## Logistic regression
Distribution: 
`$$y_i \sim Binomial(n,p)$$`
`$$p=\alpha+\beta*x_i$$`
--

Map values to [0,1]: 
`$$f(p) =log(\frac{p}{1-p})$$`
`$$logit(p)=\alpha+\beta*x_i$$`

???
Our y or the outcome are counts that are following Binomial distribution. 
What we would like to estimate/model is the probability (p) 
What would happen if we would model probability using linear regression? 
We have these issues: 
Errors are not necessarily normal 
Mean and SD are related 
Linear function is unbounded - does not have to result in a number between 0 and 1

---

## First step
`$$Odds =\frac{p_{i}}{1-p_{i}}$$`

---

## Second step

`$$logOdds =log(\frac{p_{i}}{1-p_{i}})$$`

---

## What is the model doing?

<img src="image3.png", width = "100%">
---

## What is the model doing?

---

## Relation between LogOdds and Probabilities
.center[
<img src="image4.png", width = "50%">
]
---

## Generalized linear models
.center[
<img src="links.png", width = "80%">
]

---

## Logistic regression in R

We return back to our crying babies that are motivated to start crawling.

```r
par(mfrow=c(1,3), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16)
plot(Babies$Age, Babies$Height, xlab='Age (months)', ylab='Height (cm)')
boxplot(Babies$Height~Babies$Gender,xlab='Gender', ylab='Height (cm)')
boxplot(Babies$Age~Babies$Crawl,xlab='Crawl', ylab='Age (months)')
```

---

## Logistic regression in R: coefficients

Predict crawling success as a linear function of age

```r
glm1<-glm(Crawl~Age, data=Babies, family=binomial(link='logit'))
glm1$coefficients
```

```
## (Intercept) Age 
## -1.3307752 0.1194777
```
 
Three ways to interpret logistic regression: 
 1. Logit values 
 2. Odds ratios 
 3. Probabilities

---

## Odds ratios

```r
glm1$coefficients
```

```
## (Intercept)         Age 
##  -1.3307752   0.1194777
```

```r
exp(glm1$coefficients)
```

```
## (Intercept)         Age 
##   0.2642723   1.1269081
```

Intercept: When Age is zero (birth), the odds of babies starting to crawl are 74% (1-.26) less likely than not to crawl.

Age: Babies one month older are 12% more likely to start crawling

---

## Probabilities

`$\frac{1}{1+exp^{-(\alpha+\beta*x)}}$`

```r
1/(1+exp(1.33078))
```

```
## [1] 0.2090304
```

Intercept: Estimated probability of babies starting to crawl when Age is 0 (birth)

Age - we need to look where to evaluate changes on the curve:

```r
1/(1+exp(1.33078-0.11948*10))
```

```
## [1] 0.4660573
```

```r
arm::invlogit(coef(glm1)[[1]]+coef(glm1)[[2]]*mean(Babies$Age))
```

```
## [1] 0.6562395
```

---
## Interpretation continued

```r
Babies$LogOdds=-1.33078+0.11948*Babies$Age
Babies$Odds=exp(Babies$LogOdds)
Babies$Probs=Babies$Odds/(1+Babies$Odds)
par(mfrow=c(1,3), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16)
plot(Babies$Age,Babies$LogOdds)
plot(Babies$Age, Babies$Odds)
plot(Babies$Age,Babies$Probs)
```

---

## Predicted probability versus outcome

```r
ggplot(data=Babies, aes(x=Age, y=Probs))+geom_point(size=2, col='blue')+geom_point(data=Babies, aes(x=Age, y=Crawl), size=2, col='red')
```

---

## Model results

```r
summary(glm1)
```

```
## 
## Call:
## glm(formula = Crawl ~ Age, family = binomial(link = "logit"), 
##     data = Babies)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0687  -0.9876   0.5443   0.8979   1.6082  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.33078    0.50868  -2.616 0.008894 ** 
## Age          0.11948    0.03079   3.880 0.000104 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 131.79  on 99  degrees of freedom
## Residual deviance: 113.35  on 98  degrees of freedom
## AIC: 117.35
## 
## Number of Fisher Scoring iterations: 3
```
???
[Wald test](https://stats.stackexchange.com/a/60083)
---

## Logistic regression formula

What is missing in this equation?

`$$log(\frac{p}{1-p})=\alpha+\beta*x_i$$`

$$ y = \alpha + \beta * Age + \epsilon $$

We do not have an estimation of error: `$y_i \sim Binomial(n,p)$`

We do have residuals - estimate of the fit

`$$residual_i=-2*(y_i-logit^{-1}(\beta*x_i))$$`

---

## Deviance

Goodness of fit: deviance of the fitted model with respect to a perfect model 
Null deviance: `$D=-2*loglik(\beta_0)$` 
Residual deviance: `$D=-2*loglik(\beta*x)$`

```r
with(glm1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
```

```
## [1] 1.751679e-05
```

AIC: `$-2loglik(\beta*x)+2*npar$` 
BIC: `$-2loglik(\beta*x)+log(number of observations)*npar$`

???
Deviance: Tests whether the model is more accurate than simply always guessing that the outcome will be the more common of the two categories
---
## Link functions and other GLMs

We can also use other link functions: 
- Probit: `$p = \Phi(\beta*x_i)$`
- Cauchit
- log and cloglog (complementary log-log)
 
 Most common Generalized linear models:
 - Poisson regression: traffic, goals in a soccer game 
 - Negative binomial: same problems but deals with overdispersion 
 - Gamma regression: reaction times? always positive right skewed data 
---
## Absolutely cool things to note

The question that we are asking is can we stratify the probability of a success in outcome using our predictor variables

We are modelling the mean and we do not have an information on individual data points - no error

The scale of parameters is different!

Parameters are multiplicative - everything interacts especially in the tails of probability distribution

Not easy to compare the models - especially if they differ in a more than one predictor - deviance criteria gives you information how well you model fits the mean

---

class: inverse, middle, center
# Practical aspect
---

## Data

NBA dataset: [Link](https://www.kaggle.com/ionaskel/nba-games-stats-from-2014-to-2018)

```r
basketball<-read.table('Basketball.txt',sep='\t', header=T)
knitr::kable(head(basketball[,c(5,13,18,31,34,43)]), format = 'html')
```

<table>
 <thead>
 <tr>
 <th style="text-align:left;"> </th>
 <th style="text-align:left;"> Home </th>
 <th style="text-align:right;"> X3PointShots </th>
 <th style="text-align:right;"> FreeThrows. </th>
 <th style="text-align:right;"> Opp.3PointShots. </th>
 <th style="text-align:right;"> Opp.FreeThrows. </th>
 <th style="text-align:right;"> Win </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> 1 </td>
 <td style="text-align:left;"> Away </td>
 <td style="text-align:right;"> 13 </td>
 <td style="text-align:right;"> 0.529 </td>
 <td style="text-align:right;"> 0.308 </td>
 <td style="text-align:right;"> 0.818 </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 32 </td>
 <td style="text-align:left;"> Away </td>
 <td style="text-align:right;"> 6 </td>
 <td style="text-align:right;"> 0.867 </td>
 <td style="text-align:right;"> 0.323 </td>
 <td style="text-align:right;"> 0.750 </td>
 <td style="text-align:right;"> 1 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 83 </td>
 <td style="text-align:left;"> Home </td>
 <td style="text-align:right;"> 8 </td>
 <td style="text-align:right;"> 0.652 </td>
 <td style="text-align:right;"> 0.368 </td>
 <td style="text-align:right;"> 0.769 </td>
 <td style="text-align:right;"> 1 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 112 </td>
 <td style="text-align:left;"> Home </td>
 <td style="text-align:right;"> 12 </td>
 <td style="text-align:right;"> 0.684 </td>
 <td style="text-align:right;"> 0.481 </td>
 <td style="text-align:right;"> 0.941 </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 165 </td>
 <td style="text-align:left;"> Away </td>
 <td style="text-align:right;"> 7 </td>
 <td style="text-align:right;"> 0.769 </td>
 <td style="text-align:right;"> 0.364 </td>
 <td style="text-align:right;"> 0.652 </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 196 </td>
 <td style="text-align:left;"> Away </td>
 <td style="text-align:right;"> 7 </td>
 <td style="text-align:right;"> 0.611 </td>
 <td style="text-align:right;"> 0.500 </td>
 <td style="text-align:right;"> 0.650 </td>
 <td style="text-align:right;"> 1 </td>
 </tr>
</tbody>
</table>

---

## Data

```r
table(basketball$Win)
```

```
## 
## 0 1 
## 69 81
```
<img src="GLMs_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />

---

## Cross tabulation

```r
knitr::kable(table(basketball$Win, basketball$Home), format = 'html')
```

```r
datA=aggregate(cbind(basketball$X3PointShots., basketball$Opp.3PointShots., basketball$FreeThrows., basketball$Opp.FreeThrows.), list(basketball$Win), mean)
names(datA)=c('Win','X3POints_mean','Opp3POints_mean','FreeThrows_mean','OppFreeThrows_mean')
knitr::kable(datA, format = 'html')
```

<table>
 <thead>
 <tr>
 <th style="text-align:right;"> Win </th>
 <th style="text-align:right;"> X3POints_mean </th>
 <th style="text-align:right;"> Opp3POints_mean </th>
 <th style="text-align:right;"> FreeThrows_mean </th>
 <th style="text-align:right;"> OppFreeThrows_mean </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:right;"> 0 </td>
 <td style="text-align:right;"> 0.3200580 </td>
 <td style="text-align:right;"> 0.3914928 </td>
 <td style="text-align:right;"> 0.7498696 </td>
 <td style="text-align:right;"> 0.7724928 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 1 </td>
 <td style="text-align:right;"> 0.3840494 </td>
 <td style="text-align:right;"> 0.3252593 </td>
 <td style="text-align:right;"> 0.7752099 </td>
 <td style="text-align:right;"> 0.7541975 </td>
 </tr>
</tbody>
</table>

---

## Plot

```r
par(mfrow=c(1,2), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16)
plot(basketball$X3PointShots.,basketball$Win)
plot(basketball$X3PointShots.,jitter(basketball$Win, 0.5))
```

<img src="GLMs_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" />
---

## Logistic Regression: one predictor
.tiny[

```r
baskWL1<-glm(Win~Home, data=basketball, family=binomial('logit'))
summary(baskWL1)
```

```
## 
## Call:
## glm(formula = Win ~ Home, family = binomial("logit"), data = basketball)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4909  -1.0015   0.8935   0.8935   1.3642  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.4290     0.2428  -1.767 0.077296 .  
## HomeHome      1.1412     0.3410   3.346 0.000819 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 206.98  on 149  degrees of freedom
## Residual deviance: 195.33  on 148  degrees of freedom
## AIC: 199.33
## 
## Number of Fisher Scoring iterations: 4
```
]
---

## Logistic Regression: two predictors
.tiny[

```r
baskWL2<-glm(Win~Home+X3PointShots., data=basketball, family=binomial('logit'))
summary(baskWL2)
```

```
## 
## Call:
## glm(formula = Win ~ Home + X3PointShots., family = binomial("logit"), 
##     data = basketball)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9950  -1.0185   0.5857   0.9993   1.8696  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.5847     0.6977  -3.705 0.000212 ***
## HomeHome        1.1151     0.3568   3.125 0.001779 ** 
## X3PointShots.   6.1575     1.8229   3.378 0.000731 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 206.98  on 149  degrees of freedom
## Residual deviance: 182.56  on 147  degrees of freedom
## AIC: 188.56
## 
## Number of Fisher Scoring iterations: 4
```
]
---
## Model comparison

```r
anova(baskWL1, baskWL2, test = "LR")
```

```
## Analysis of Deviance Table
## 
## Model 1: Win ~ Home
## Model 2: Win ~ Home + X3PointShots.
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       148     195.34                          
## 2       147     182.56  1   12.773 0.0003516 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
## Logistic Regression: three predictors
.tiny[

```r
baskWL3<-glm(Win~Home+X3PointShots.+Opp.3PointShots., data=basketball, family=binomial('logit'))
anova(baskWL1,baskWL2, baskWL3, test = "LR")
```

```
## Analysis of Deviance Table
## 
## Model 1: Win ~ Home
## Model 2: Win ~ Home + X3PointShots.
## Model 3: Win ~ Home + X3PointShots. + Opp.3PointShots.
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       148     195.34                          
## 2       147     182.56  1   12.773 0.0003516 ***
## 3       146     166.93  1   15.635 7.683e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```r
baskWL4<-glm(Win~Home*X3PointShots.+Opp.3PointShots., data=basketball, family=binomial('logit'))
anova(baskWL3, baskWL4, test = "LR")
```

```
## Analysis of Deviance Table
## 
## Model 1: Win ~ Home + X3PointShots. + Opp.3PointShots.
## Model 2: Win ~ Home * X3PointShots. + Opp.3PointShots.
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       146     166.93                     
## 2       145     166.85  1 0.078666   0.7791
```
]
---
## Visualising results

```r
basketball$Prob_mod2=predict(baskWL2, type='response')
ggplot(data=basketball, aes(x=X3PointShots., y=Prob_mod2, colour=Home))+geom_point(size=2)+geom_point(data=basketball, aes(x=X3PointShots., y=Win), size=2, col='blue')
```

---
## Visualising results

```r
basketball$Prob_mod3=predict(baskWL3, type='response')
ggplot(data=basketball, aes(x=X3PointShots., y=Prob_mod3, colour=Home))+geom_point(size=2)+geom_point(data=basketball, aes(x=X3PointShots., y=Win), size=2, col='blue')
```

---

## Confidence intervals and R2

```r
confint(baskWL3)
```

```
## Waiting for profiling to be done...
```

```
##                        2.5 %    97.5 %
## (Intercept)       -2.0130842  1.684825
## HomeHome           0.3598188  1.847523
## X3PointShots.      3.0259724 10.586459
## Opp.3PointShots. -11.2785830 -3.544063
```

```r
require(DescTools)
PseudoR2(baskWL3, which='McFadden')
```

```
##  McFadden 
## 0.1935245
```

```r
PseudoR2(baskWL3, which='Nagelkerke')
```

```
## Nagelkerke 
##  0.3131498
```

[Pseudo R2 formulas](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/)
---

## Model check

```r
Ctabs<-table(basketball$Win,basketball$Prob_mod3>0.5)
Ctabs
```

```
##    
##     FALSE TRUE
##   0    47   22
##   1    19   62
```

Recall: `$\frac{true positive}{true positive+false negatives}=\frac{62}{62+19}=0.76$` 
Precision: `$\frac{true positives}{true poisitives + false positives}=\frac{62}{62+22}=0.73$` 
F1 score: `$2*\frac{Precision * Recall}{Precision+Recall} = 0.74$`
???
https://en.wikipedia.org/wiki/Precision_and_recall
---
## Important aspects: theory

- How do we use linear model with the non-normal outcomes (link functions) 
- What are logit values, how do we get to them 
- How to interpret logistic regression: logits, odds, and probabilities 
- What is deviance of the model and AIC

---
## Important aspects: practice

- Building a logistic model 
- Predicting values from logistic model 
- Transforming logit values to odds and to probabilities 
- Comparing models using deviance, anova function or AIC information 
- Cross-tabulating model predictions with observed data

---
## Literature

Chapter 5 of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill

Chapter 10 of "Statistical Rethinking: A Bayesian Course with Examples in R and Stan " by Richard McElreath + check out his youtube channel

Jaeger, F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434-446.

---

# Thank you for your attention