Data simulation
We can also simulate categorical outcomes.
set.seed(456)
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500))
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5)
Babies$Gender = rbinom(100, 1, 0.5)
Babies$Crawl = rbinom(100, 1, 0.031 * Babies$Age + 1e-05 * Babies$Weight - 0.06 *
Babies$Gender) #I simulated Crawling data from random binomial distribution, where I took out 100 times 1 sample. Probability of success was defined in relation to Babies Age, Weigh and Gender.
Babies$Gender = as.factor(Babies$Gender) # I recode these numbers to factor
levels(Babies$Gender) = c("Girls", "Boys") # Assigning labels to Gender factor
table(Babies$Crawl)
##
## 0 1
## 37 63
Plotting probability density function for Number of babies crawling
in our data
par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(1:100, dbinom(1:100, 0.63, size = 100), xlab = "N of babies crawling", ylab = "Probability",
type = "l")
Relation between log odds, odds, and probabilities
require(ggplot2) #load in ggplot2
logit <- data.frame(LogOdds = seq(-2.5, 2.5, by = 0.1), Pred = seq(-2.5, 2.5, by = 0.1)) #create data frame where variable LogOdds containts numbers from -2.5 to 2.5 changing by 0.1
logit$Odds = exp(logit$LogOdds) #exponentiate logits that results in odds
logit$Probabilities = logit$Odds/(1 + logit$Odds) #transform odds ratios to probabilities
ggplot(data = logit, aes(x = Pred, y = Odds)) + geom_point(size = 2) + theme_bw() +
ylim(0, 13) + theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12)) #plotting odds
Plotting Log-Odds:
ggplot(data = logit, aes(x = Pred, y = LogOdds)) + geom_point(size = 2) + theme_bw() +
ylim(-4, 4) + theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))
Plotting probabilities:
ggplot(data = logit, aes(x = Pred, y = Probabilities)) + geom_point(size = 2) + theme_bw() +
ylim(0, 1) + theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))
We can again check what is in our simulated data:
par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(Babies$Age, Babies$Height, xlab = "Age (months)", ylab = "Height (cm)")
boxplot(Babies$Height ~ Babies$Gender, xlab = "Gender", ylab = "Height (cm)")
boxplot(Babies$Age ~ Babies$Crawl, xlab = "Crawl", ylab = "Age (months)") #Boxplot of Babies Age across our Crawling outcome, xlab - label of x-axis, ylab - label of y-axis
Logistic regression model
Let’s build first logistic regression model:
glm1 <- glm(Crawl ~ Age, data = Babies, family = binomial(link = "logit")) #glm function is used to fit generalized linear models (try typing in your console: ?glm). Crawl is modelled as a function of Age and we are using Babies data. Family specified type of the outcome distribution. We are saying that this our oucome follows Binomial distribution, while we want to use logit link - logOdds transformation.
glm1$coefficients #we are printing only coefficients
## (Intercept) Age
## -1.3307752 0.1194777
Let’s get Odds:
glm1$coefficients
## (Intercept) Age
## -1.3307752 0.1194777
exp(glm1$coefficients) #we can use exponential function to transform our coefficients to Odds ratios - how more likely it is that babies are going to start crawling if they are older by 1 month (beta coefficient - slope)
## (Intercept) Age
## 0.2642723 1.1269081
Let’s get probabilities:
1/(1 + exp(1.33078)) # only intercept
## [1] 0.2090304
1/(1 + exp(1.33078 - 0.11948 * 10)) #What is the probability of babies starting to crawl when they are 10 months?
## [1] 0.4660573
arm::invlogit(coef(glm1)[[1]] + coef(glm1)[[2]] * mean(Babies$Age)) # what is the probability of babies starting to crawl when they are mean of their age (around 16 months). I used invlogit function to automatically calculate probabilities. coef(glm1)[[1]] - gives me intercept value, while coef(glm1)[[2]] - gives me slope value for age.
## [1] 0.6562395
We can plot inner workings of our model: Logit, odds and
probabilities
Babies$LogOdds = -1.33078 + 0.11948 * Babies$Age #Outputing logit values based on our model
Babies$Odds = exp(Babies$LogOdds) #Transforming them to odds
Babies$Probs = Babies$Odds/(1 + Babies$Odds) # Transforming odds to probabilities
par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(Babies$Age, Babies$LogOdds)
plot(Babies$Age, Babies$Odds)
plot(Babies$Age, Babies$Probs)
Lets see what goes into the model and how our model sees the
data:
ggplot(data = Babies, aes(x = Age, y = Probs)) + geom_point(size = 2, col = "blue") +
geom_point(data = Babies, aes(x = Age, y = Crawl), size = 2, col = "red") #Red points show our dependent outcome - 0,1; blue points show estimated probabilities across the values of Age.
Summary of the results
summary(glm1) #Summary of the complete model
##
## Call:
## glm(formula = Crawl ~ Age, family = binomial(link = "logit"),
## data = Babies)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0687 -0.9876 0.5443 0.8979 1.6082
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.33078 0.50868 -2.616 0.008894 **
## Age 0.11948 0.03079 3.880 0.000104 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 131.79 on 99 degrees of freedom
## Residual deviance: 113.35 on 98 degrees of freedom
## AIC: 117.35
##
## Number of Fisher Scoring iterations: 3
We can calculate whether our proposed model is improvement in
comparison to the null (only intercept model) - statistically
significant:
with(glm1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE)) #I am asking R to take values from my glm1 object (our model), where I am asking him to take p-values from chi-square distribution where my values are difference between null deviance model and our my fitted model, differences in the degrees of freedom, and we are looking at the right side of the probability distribution with lower.tail=FALSE
## [1] 1.751679e-05
Practical aspect:
basketball <- read.table("Basketball.txt", sep = "\t", header = T) #Loading the data in R. Data file is .txt file, where separator is tab delimited (tab delimited values) and the names of my variables are in the first row (header=T)
knitr::kable(head(basketball[, c(5, 13, 18, 31, 34, 43)]), format = "html") #Printing only certain columns (numbers in c())
|
Home
|
X3PointShots
|
FreeThrows.
|
Opp.3PointShots.
|
Opp.FreeThrows.
|
Win
|
1
|
Away
|
13
|
0.529
|
0.308
|
0.818
|
0
|
32
|
Away
|
6
|
0.867
|
0.323
|
0.750
|
1
|
83
|
Home
|
8
|
0.652
|
0.368
|
0.769
|
1
|
112
|
Home
|
12
|
0.684
|
0.481
|
0.941
|
0
|
165
|
Away
|
7
|
0.769
|
0.364
|
0.652
|
0
|
196
|
Away
|
7
|
0.611
|
0.500
|
0.650
|
1
|
Let’s plot the data to see how it looks like:
table(basketball$Win) #Tabulate outcome
##
## 0 1
## 69 81
par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(density(basketball$X3PointShots.), main = "", xlab = "3POintsShots")
plot(density(basketball$Opp.3PointShots.), main = "", xlab = "Opp3PointsShots")
We can also cross-tabulate the data. Focus on combinations of
categories:
knitr::kable(table(basketball$Win, basketball$Home), format = "html")
|
Away
|
Home
|
0
|
43
|
26
|
1
|
28
|
53
|
datA = aggregate(cbind(basketball$X3PointShots., basketball$Opp.3PointShots., basketball$FreeThrows.,
basketball$Opp.FreeThrows.), list(basketball$Win), mean) #aggregate function aggregates our data. I used this function to calculate arithmetic mean for X3POintsShots, Opp3Points shots, FreeThrows, OppFreeThrows for each outcome value (0,1). cbind is used to join all numerical values together in one dataframe. In other words, we are not aggregating one by one numerical variable, as cbind allows us to do that jointly. Instead of mean, you can use sd or min or max or whatever you want.
names(datA) = c("Win", "X3POints_mean", "Opp3POints_mean", "FreeThrows_mean", "OppFreeThrows_mean") #as my aggregate returns new data frame without the names of the variables I just quickly attach the names to them.
knitr::kable(datA, format = "html") #Writting it out
Win
|
X3POints_mean
|
Opp3POints_mean
|
FreeThrows_mean
|
OppFreeThrows_mean
|
0
|
0.3200580
|
0.3914928
|
0.7498696
|
0.7724928
|
1
|
0.3840494
|
0.3252593
|
0.7752099
|
0.7541975
|
Plotting predictors and categorical outcome
par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(basketball$X3PointShots., basketball$Win)
plot(basketball$X3PointShots., jitter(basketball$Win, 0.5))
Let’s make first model (one predictor):
baskWL1 <- glm(Win ~ Home, data = basketball, family = binomial("logit")) #Main effect of Home
summary(baskWL1)
##
## Call:
## glm(formula = Win ~ Home, family = binomial("logit"), data = basketball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4909 -1.0015 0.8935 0.8935 1.3642
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4290 0.2428 -1.767 0.077296 .
## HomeHome 1.1412 0.3410 3.346 0.000819 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 206.98 on 149 degrees of freedom
## Residual deviance: 195.33 on 148 degrees of freedom
## AIC: 199.33
##
## Number of Fisher Scoring iterations: 4
Model with two predictors:
baskWL2 <- glm(Win ~ Home + X3PointShots., data = basketball, family = binomial("logit")) #Main effect of home and X3PointsShots.
summary(baskWL2)
##
## Call:
## glm(formula = Win ~ Home + X3PointShots., family = binomial("logit"),
## data = basketball)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9950 -1.0185 0.5857 0.9993 1.8696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5847 0.6977 -3.705 0.000212 ***
## HomeHome 1.1151 0.3568 3.125 0.001779 **
## X3PointShots. 6.1575 1.8229 3.378 0.000731 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 206.98 on 149 degrees of freedom
## Residual deviance: 182.56 on 147 degrees of freedom
## AIC: 188.56
##
## Number of Fisher Scoring iterations: 4
Model comparison (model 2 versus model 1)
anova(baskWL1, baskWL2, test = "LR") #compare two models where we use likelihood ratio test - for generalized linear models
## Analysis of Deviance Table
##
## Model 1: Win ~ Home
## Model 2: Win ~ Home + X3PointShots.
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 148 195.34
## 2 147 182.56 1 12.773 0.0003516 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Three predictors:
baskWL3 <- glm(Win ~ Home + X3PointShots. + Opp.3PointShots., data = basketball,
family = binomial("logit")) #Main effect of Home, X3PointsShots. and Opp.3PointsShots.
anova(baskWL1, baskWL2, baskWL3, test = "LR") #comparing three models
## Analysis of Deviance Table
##
## Model 1: Win ~ Home
## Model 2: Win ~ Home + X3PointShots.
## Model 3: Win ~ Home + X3PointShots. + Opp.3PointShots.
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 148 195.34
## 2 147 182.56 1 12.773 0.0003516 ***
## 3 146 166.93 1 15.635 7.683e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interactions:
baskWL4 <- glm(Win ~ Home * X3PointShots. + Opp.3PointShots., data = basketball,
family = binomial("logit")) #Three main effects and interaction between Home and X3PointsShots
anova(baskWL3, baskWL4, test = "LR") #comparing third and fourth model
## Analysis of Deviance Table
##
## Model 1: Win ~ Home + X3PointShots. + Opp.3PointShots.
## Model 2: Win ~ Home * X3PointShots. + Opp.3PointShots.
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 146 166.93
## 2 145 166.85 1 0.078666 0.7791
Visualising our model with two predictors:
basketball$Prob_mod2 = predict(baskWL2, type = "response")
ggplot(data = basketball, aes(x = X3PointShots., y = Prob_mod2, colour = Home)) +
geom_point(size = 2) + geom_point(data = basketball, aes(x = X3PointShots., y = Win),
size = 2, col = "blue")
Visualising our model with three predictors
basketball$Prob_mod3 = predict(baskWL3, type = "response")
ggplot(data = basketball, aes(x = X3PointShots., y = Prob_mod3, colour = Home)) +
geom_point(size = 2) + geom_point(data = basketball, aes(x = X3PointShots., y = Win),
size = 2, col = "blue")
It is quite nice to report confidence intervals in your work:
confint(baskWL3) #confidence intervals for logit estimates in our third model
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -2.0130842 1.684825
## HomeHome 0.3598188 1.847523
## X3PointShots. 3.0259724 10.586459
## Opp.3PointShots. -11.2785830 -3.544063
exp(confint(baskWL3)) #confidence intervals for odds ratios in our third model
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.335761e-01 5.391508e+00
## HomeHome 1.433070e+00 6.344088e+00
## X3PointShots. 2.061404e+01 3.959502e+04
## Opp.3PointShots. 1.264077e-05 2.889568e-02
Let’s see how accurate is our model:
Ctabs <- table(basketball$Win, basketball$Prob_mod3 > 0.5)
Ctabs
##
## FALSE TRUE
## 0 47 22
## 1 19 62