class: center, inverse background-image: url("main2.png") background-size: contain --- <style type="text/css"> body, td { font-size: 15px; } code.r{ font-size: 15px; } pre { font-size: 20px } .huge .remark-code { /*Change made here*/ font-size: 200% !important; } .tiny .remark-code { /*Change made here*/ font-size: 80% !important; } </style> ## Press record --- ## Corrections from previous lecture 1. Intercept in the model with Age <br/> <br/> 2. Anova comparison --- ## First meeting exercises (Friday) Did you manage to do any of those? --- ##R code for this lecture [Link](https://nvaci.github.io/Lecture_2_code/Lecture2_Rcode) --- ## What we know .pull-left[ Gaussian distribution: `\(y_{i} \sim \mathcal{N}(\mu_{i},\sigma)\)` <br/> <br/> `\(\mu_{i} = \alpha+\beta*x_i\)` <br/><br/> Model mean and standard deviation <br/> <br/> <br/> <br/> a) Errors: `\(\mathcal{N}^{iid}(0,\sigma^2)\)` <br/> <br/> b) Linearity and additivity <br/> <br/> c) Validity <br/> <br/> d) Lack of perfect multicolinearity <br/> <br/> ] .pull-right[ <img src="gauss.png", width = "120%"> <br/> Google Doodle: Johann Carl Friedrich Gauß’s 241st Birthday ] --- ## What are these data? .center[ <img src="image.png", width = "50%"> ] --- ## Probability distributions Exponential family distribution: <br/> .pull-left[Continous outcomes: <br/><br/> Categorical outcomes (Yes/No or multiple outcomes): <br/><br/><br/> Counted outcomes (Success, number of multiple outcomes, occurences in continuous domains): <br/><br/> <br/> <br/> <br/> +reals data (always positive): ] .pull-right[ Normal (Gaussian) distribution<br/><br/> Bernoulli distribution, Categorical distribution<br/><br/> <br/> Binomial distribution, Multinomial distribution, Poisson distribution <br/><br/><br/><br/> Exponential, Gamma or Inverse Gaussian <br/><br/> ] --- ## Exponential family .center[ <img src="Exp1.png", width = "90%"> <br/> ] --- ## Exponential family .center[ <img src="Exp2.png", width = "90%"> <br/> ] --- ## Exponential family .center[ <img src="Exp3.png", width = "90%"> <br/> ] --- ## Exponential family .center[ <img src="Exp4.png", width = "90%"> <br/> ] --- ## Exponential family .center[ <img src="Exp5.png", width = "90%"> <br/> ] ??? We can use all these functions as an outcome in the generalized linear models<br/> Poisson - counts with no upper bound <br/> Gamma - always positive, time to an event with multiple components (eg. age onset of cancer) <br/><br/> Many of these distributions will converge to Normal: Binomial if the probability is far from edge, Poisson and Gamma if the mean is large - remember central limit theorem<br/><br/> https://en.wikipedia.org/wiki/Exponential_family<br/><br/> https://i.stack.imgur.com/HgpO4.jpg <br/><br/> Check out amazing lectures from Richard McElreath (this lecture uses his idea extensively): [youtube channel](https://www.youtube.com/channel/UCNJK6_DZvcMqNSzQdEkzvzA) --- ## Binomial distribution A coint is flipped 100 times, what is the probability that heads up results in 75 times? <br/><br/> Number of successes of n number of independent Bernoulli trials -> Binomial distribution:<br/><br/> <br/> `\(P(X=x)=(^n_x)p^x*(1-p)^{n-x}\)` <br/><br/> `\(\mu=n*p\)` <br/><br/> `\(\sigma=n*p(1-p)\)` --- ## Babies ``` ## ## 0 1 ## 37 63 ``` ```r par(mfrow=c(1,1), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(1:100,dbinom(1:100,0.63, size=100), xlab='N of babies crawling', ylab='Probability', type='l') ``` <img src="GLMs_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto 0 auto auto;" /> --- ## Babies `\(\mu=100*0.63=63\)` <br/><br/> `\(\sigma=\sqrt(100*0.63*(1-0.63))=4.82\)` Exactly: ```r x=75 choose(100, x)*0.63^x*(1-0.63)^(100-x) ``` ``` ## [1] 0.003470009 ``` At least: ```r x=64:100 sum(choose(100, x)*0.63^x*(1-0.63)^(100-x)) ``` ``` ## [1] 0.4623385 ``` --- ## Logistic regression Distribution: <br/> `$$y_i \sim Binomial(n,p)$$` `$$p=\alpha+\beta*x_i$$` -- Map values to [0,1]: <br/> `$$f(p) =log(\frac{p}{1-p})$$` `$$logit(p)=\alpha+\beta*x_i$$` ??? Our y or the outcome are counts that are following Binomial distribution.<br/> What we would like to estimate/model is the probability (p)<br/> What would happen if we would model probability using linear regression?<br/><br/> We have these issues:<br/> Errors are not necessarily normal <br/> Mean and SD are related <br/> Linear function is unbounded - does not have to result in a number between 0 and 1 <br/> --- ## First step `$$Odds =\frac{p_{i}}{1-p_{i}}$$` <img src="GLMs_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Second step `$$logOdds =log(\frac{p_{i}}{1-p_{i}})$$` <img src="GLMs_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## What is the model doing? <img src="image3.png", width = "100%"> --- ## What is the model doing? <img src="image2a.png", width = "100%"> --- ## Relation between LogOdds and Probabilities .center[ <img src="image4.png", width = "50%"> ] --- ## Generalized linear models .center[ <img src="links.png", width = "80%"> ] --- ## Logistic regression in R We return back to our crying babies that are motivated to start crawling. ```r par(mfrow=c(1,3), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(Babies$Age, Babies$Height, xlab='Age (months)', ylab='Height (cm)') boxplot(Babies$Height~Babies$Gender,xlab='Gender', ylab='Height (cm)') boxplot(Babies$Age~Babies$Crawl,xlab='Crawl', ylab='Age (months)') ``` <img src="GLMs_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## Logistic regression in R: coefficients Predict crawling success as a linear function of age ```r glm1<-glm(Crawl~Age, data=Babies, family=binomial(link='logit')) glm1$coefficients ``` ``` ## (Intercept) Age ## -1.3307752 0.1194777 ``` <br/><br/> Three ways to interpret logistic regression:<br/><br/> 1. Logit values<br/> 2. Odds ratios<br/> 3. Probabilities<br/> --- ## Odds ratios ```r glm1$coefficients ``` ``` ## (Intercept) Age ## -1.3307752 0.1194777 ``` ```r exp(glm1$coefficients) ``` ``` ## (Intercept) Age ## 0.2642723 1.1269081 ``` Intercept: When Age is zero (birth), the odds of babies starting to crawl are 74% (1-.26) less likely than not to crawl. Age: Babies one month older are 12% more likely to start crawling --- ## Probabilities `\(\frac{1}{1+exp^{-(\alpha+\beta*x)}}\)`<br/><br/> ```r 1/(1+exp(1.33078)) ``` ``` ## [1] 0.2090304 ``` Intercept: Estimated probability of babies starting to crawl when Age is 0 (birth)<br/><br/> Age - we need to look where to evaluate changes on the curve:<br/> ```r 1/(1+exp(1.33078-0.11948*10)) ``` ``` ## [1] 0.4660573 ``` ```r arm::invlogit(coef(glm1)[[1]]+coef(glm1)[[2]]*mean(Babies$Age)) ``` ``` ## [1] 0.6562395 ``` --- ## Interpretation continued ```r Babies$LogOdds=-1.33078+0.11948*Babies$Age Babies$Odds=exp(Babies$LogOdds) Babies$Probs=Babies$Odds/(1+Babies$Odds) par(mfrow=c(1,3), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(Babies$Age,Babies$LogOdds) plot(Babies$Age, Babies$Odds) plot(Babies$Age,Babies$Probs) ``` <img src="GLMs_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## Predicted probability versus outcome ```r ggplot(data=Babies, aes(x=Age, y=Probs))+geom_point(size=2, col='blue')+geom_point(data=Babies, aes(x=Age, y=Crawl), size=2, col='red') ``` <img src="GLMs_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Model results ```r summary(glm1) ``` ``` ## ## Call: ## glm(formula = Crawl ~ Age, family = binomial(link = "logit"), ## data = Babies) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.0687 -0.9876 0.5443 0.8979 1.6082 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.33078 0.50868 -2.616 0.008894 ** ## Age 0.11948 0.03079 3.880 0.000104 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 131.79 on 99 degrees of freedom ## Residual deviance: 113.35 on 98 degrees of freedom ## AIC: 117.35 ## ## Number of Fisher Scoring iterations: 3 ``` ??? [Wald test](https://stats.stackexchange.com/a/60083) --- ## Logistic regression formula What is missing in this equation? `$$log(\frac{p}{1-p})=\alpha+\beta*x_i$$` <br/> <br/> $$ y = \alpha + \beta * Age + \epsilon $$ <br/><br/> -- We do not have an estimation of error: `\(y_i \sim Binomial(n,p)\)` We do have residuals - estimate of the fit <br/> `$$residual_i=-2*(y_i-logit^{-1}(\beta*x_i))$$` --- ## Deviance Goodness of fit: deviance of the fitted model with respect to a perfect model <br/><br/> Null deviance: `\(D=-2*loglik(\beta_0)\)` <br/><br/> Residual deviance: `\(D=-2*loglik(\beta*x)\)` <br/><br/> ```r with(glm1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE)) ``` ``` ## [1] 1.751679e-05 ``` AIC: `\(-2loglik(\beta*x)+2*npar\)`<br/><br/> BIC: `\(-2loglik(\beta*x)+log(number of observations)*npar\)` ??? Deviance: Tests whether the model is more accurate than simply always guessing that the outcome will be the more common of the two categories --- ## Link functions and other GLMs We can also use other link functions: <br/><br/> - Probit: `\(p = \Phi(\beta*x_i)\)` - Cauchit - log and cloglog (complementary log-log) <br/><br/> <br/> Most common Generalized linear models: - Poisson regression: traffic, goals in a soccer game <br/> - Negative binomial: same problems but deals with overdispersion <br/> - Gamma regression: reaction times? always positive right skewed data <br/> --- ## Absolutely cool things to note The question that we are asking is can we stratify the probability of a success in outcome using our predictor variables <br/><br/> We are modelling the mean and we do not have an information on individual data points - no error <br/><br/> The scale of parameters is different! <br/><br/> Parameters are multiplicative - everything interacts especially in the tails of probability distribution <br/><br/> Not easy to compare the models - especially if they differ in a more than one predictor - deviance criteria gives you information how well you model fits the mean <br/> --- class: inverse, middle, center # Practical aspect --- ## Data NBA dataset: [Link](https://www.kaggle.com/ionaskel/nba-games-stats-from-2014-to-2018) ```r basketball<-read.table('Basketball.txt',sep='\t', header=T) knitr::kable(head(basketball[,c(5,13,18,31,34,43)]), format = 'html') ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Home </th> <th style="text-align:right;"> X3PointShots </th> <th style="text-align:right;"> FreeThrows. </th> <th style="text-align:right;"> Opp.3PointShots. </th> <th style="text-align:right;"> Opp.FreeThrows. </th> <th style="text-align:right;"> Win </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> Away </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 0.529 </td> <td style="text-align:right;"> 0.308 </td> <td style="text-align:right;"> 0.818 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 32 </td> <td style="text-align:left;"> Away </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0.867 </td> <td style="text-align:right;"> 0.323 </td> <td style="text-align:right;"> 0.750 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> 83 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 0.652 </td> <td style="text-align:right;"> 0.368 </td> <td style="text-align:right;"> 0.769 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> 112 </td> <td style="text-align:left;"> Home </td> <td style="text-align:right;"> 12 </td> <td style="text-align:right;"> 0.684 </td> <td style="text-align:right;"> 0.481 </td> <td style="text-align:right;"> 0.941 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 165 </td> <td style="text-align:left;"> Away </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 0.769 </td> <td style="text-align:right;"> 0.364 </td> <td style="text-align:right;"> 0.652 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> 196 </td> <td style="text-align:left;"> Away </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 0.611 </td> <td style="text-align:right;"> 0.500 </td> <td style="text-align:right;"> 0.650 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> --- ## Data ```r table(basketball$Win) ``` ``` ## ## 0 1 ## 69 81 ``` <img src="GLMs_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- ## Cross tabulation ```r knitr::kable(table(basketball$Win, basketball$Home), format = 'html') ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Away </th> <th style="text-align:right;"> Home </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 0 </td> <td style="text-align:right;"> 43 </td> <td style="text-align:right;"> 26 </td> </tr> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 28 </td> <td style="text-align:right;"> 53 </td> </tr> </tbody> </table> ```r datA=aggregate(cbind(basketball$X3PointShots., basketball$Opp.3PointShots., basketball$FreeThrows., basketball$Opp.FreeThrows.), list(basketball$Win), mean) names(datA)=c('Win','X3POints_mean','Opp3POints_mean','FreeThrows_mean','OppFreeThrows_mean') knitr::kable(datA, format = 'html') ``` <table> <thead> <tr> <th style="text-align:right;"> Win </th> <th style="text-align:right;"> X3POints_mean </th> <th style="text-align:right;"> Opp3POints_mean </th> <th style="text-align:right;"> FreeThrows_mean </th> <th style="text-align:right;"> OppFreeThrows_mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0.3200580 </td> <td style="text-align:right;"> 0.3914928 </td> <td style="text-align:right;"> 0.7498696 </td> <td style="text-align:right;"> 0.7724928 </td> </tr> <tr> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.3840494 </td> <td style="text-align:right;"> 0.3252593 </td> <td style="text-align:right;"> 0.7752099 </td> <td style="text-align:right;"> 0.7541975 </td> </tr> </tbody> </table> --- ## Plot ```r par(mfrow=c(1,2), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(basketball$X3PointShots.,basketball$Win) plot(basketball$X3PointShots.,jitter(basketball$Win, 0.5)) ``` <img src="GLMs_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ## Logistic Regression: one predictor .tiny[ ```r baskWL1<-glm(Win~Home, data=basketball, family=binomial('logit')) summary(baskWL1) ``` ``` ## ## Call: ## glm(formula = Win ~ Home, family = binomial("logit"), data = basketball) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.4909 -1.0015 0.8935 0.8935 1.3642 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.4290 0.2428 -1.767 0.077296 . ## HomeHome 1.1412 0.3410 3.346 0.000819 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 206.98 on 149 degrees of freedom ## Residual deviance: 195.33 on 148 degrees of freedom ## AIC: 199.33 ## ## Number of Fisher Scoring iterations: 4 ``` ] --- ## Logistic Regression: two predictors .tiny[ ```r baskWL2<-glm(Win~Home+X3PointShots., data=basketball, family=binomial('logit')) summary(baskWL2) ``` ``` ## ## Call: ## glm(formula = Win ~ Home + X3PointShots., family = binomial("logit"), ## data = basketball) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.9950 -1.0185 0.5857 0.9993 1.8696 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.5847 0.6977 -3.705 0.000212 *** ## HomeHome 1.1151 0.3568 3.125 0.001779 ** ## X3PointShots. 6.1575 1.8229 3.378 0.000731 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 206.98 on 149 degrees of freedom ## Residual deviance: 182.56 on 147 degrees of freedom ## AIC: 188.56 ## ## Number of Fisher Scoring iterations: 4 ``` ] --- ## Model comparison ```r anova(baskWL1, baskWL2, test = "LR") ``` ``` ## Analysis of Deviance Table ## ## Model 1: Win ~ Home ## Model 2: Win ~ Home + X3PointShots. ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 148 195.34 ## 2 147 182.56 1 12.773 0.0003516 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Logistic Regression: three predictors .tiny[ ```r baskWL3<-glm(Win~Home+X3PointShots.+Opp.3PointShots., data=basketball, family=binomial('logit')) anova(baskWL1,baskWL2, baskWL3, test = "LR") ``` ``` ## Analysis of Deviance Table ## ## Model 1: Win ~ Home ## Model 2: Win ~ Home + X3PointShots. ## Model 3: Win ~ Home + X3PointShots. + Opp.3PointShots. ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 148 195.34 ## 2 147 182.56 1 12.773 0.0003516 *** ## 3 146 166.93 1 15.635 7.683e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ```r baskWL4<-glm(Win~Home*X3PointShots.+Opp.3PointShots., data=basketball, family=binomial('logit')) anova(baskWL3, baskWL4, test = "LR") ``` ``` ## Analysis of Deviance Table ## ## Model 1: Win ~ Home + X3PointShots. + Opp.3PointShots. ## Model 2: Win ~ Home * X3PointShots. + Opp.3PointShots. ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 146 166.93 ## 2 145 166.85 1 0.078666 0.7791 ``` ] --- ## Visualising results ```r basketball$Prob_mod2=predict(baskWL2, type='response') ggplot(data=basketball, aes(x=X3PointShots., y=Prob_mod2, colour=Home))+geom_point(size=2)+geom_point(data=basketball, aes(x=X3PointShots., y=Win), size=2, col='blue') ``` <img src="GLMs_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- ## Visualising results ```r basketball$Prob_mod3=predict(baskWL3, type='response') ggplot(data=basketball, aes(x=X3PointShots., y=Prob_mod3, colour=Home))+geom_point(size=2)+geom_point(data=basketball, aes(x=X3PointShots., y=Win), size=2, col='blue') ``` <img src="GLMs_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ## Confidence intervals and R2 ```r confint(baskWL3) ``` ``` ## Waiting for profiling to be done... ``` ``` ## 2.5 % 97.5 % ## (Intercept) -2.0130842 1.684825 ## HomeHome 0.3598188 1.847523 ## X3PointShots. 3.0259724 10.586459 ## Opp.3PointShots. -11.2785830 -3.544063 ``` ```r require(DescTools) PseudoR2(baskWL3, which='McFadden') ``` ``` ## McFadden ## 0.1935245 ``` ```r PseudoR2(baskWL3, which='Nagelkerke') ``` ``` ## Nagelkerke ## 0.3131498 ``` [Pseudo R2 formulas](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/) --- ## Model check ```r Ctabs<-table(basketball$Win,basketball$Prob_mod3>0.5) Ctabs ``` ``` ## ## FALSE TRUE ## 0 47 22 ## 1 19 62 ``` Recall: `\(\frac{true positive}{true positive+false negatives}=\frac{62}{62+19}=0.76\)` <br/><br/> Precision: `\(\frac{true positives}{true poisitives + false positives}=\frac{62}{62+22}=0.73\)`<br/><br/> F1 score: `\(2*\frac{Precision * Recall}{Precision+Recall} = 0.74\)` ??? https://en.wikipedia.org/wiki/Precision_and_recall --- ## Important aspects: theory - How do we use linear model with the non-normal outcomes (link functions) <br/><br/> - What are logit values, how do we get to them <br/><br/> - How to interpret logistic regression: logits, odds, and probabilities<br/><br/> - What is deviance of the model and AIC --- ## Important aspects: practice - Building a logistic model <br/><br/> - Predicting values from logistic model <br/><br/> - Transforming logit values to odds and to probabilities<br/><br/> - Comparing models using deviance, anova function or AIC information<br/><br/> - Cross-tabulating model predictions with observed data --- ## Literature Chapter 5 of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill <br/> Chapter 10 of "Statistical Rethinking: A Bayesian Course with Examples in R and Stan " by Richard McElreath + check out his youtube channel Jaeger, F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434-446. --- # Thank you for your attention