This document outlines the R code that was used across the lectures for Advanced Statistics course at MSc in Psychological Research Methods programme at the University of Sheffield, UK. The code follows the lectures, but also expands on the content covered by the lecture.
Lecture 1: Linear regression
Introduction
First, we used the dataset that are included in R packages. We can see what data do we have in standard R packages by calling: data() or we can see datasets specific for certain R packages by calling: data(package=‘lme4’).
In this case, we can call pre-loaded dataset cars. More information on this data is available by calling ?cars or help(cars).
data("cars") #calls the dataset
head(cars) # prints first 6 observations in the dataset
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
We can use scatter plot to see the raw values.
plot(cars$speed, cars$dist, xlab = "Predictor", ylab = "Outcome") #As this plot was only used to make an illustration of the linear regression, we changed the labels of x and y-axis. This was done using xlab and ylab parameters. You can change the name of the labels using the same code, eg. xlab='Speed (mph)', ylab='Stopping distance (ft)'
abline(lm(dist ~ speed, data = cars), lwd = 2) #abline function adds one or more straight lines through the plot that you have active in your environment. lm function is used to fit linear models, where dist is modelled as a function of speed. We also indicate that values for distance and speed can be found in the cars dataset (data = cars). Finally, we specify the thickness of abline function with lwd=2 parameter.
Data simulation
We can also simulate some data:
set.seed(456) # we can set starting numbers that are used to generate our random values. Random numbers are rarely truly random: https://engineering.mit.edu/engage/ask-an-engineer/can-a-computer-generate-a-truly-random-number/
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500)) #We create a new data frame (Babies) where we have Age and Weight as variables. 100 Age values are sampled from random uniform distribution (runif) with lower bound of 1 (minimum) and upper bound of 30 (maximum). 100 Weight values are generated using random normal distribution (rnorm) with mean of 4000 and SD of 500
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5) # Height is generated using random normal distribution where mean is a function of Age and Weight, while SD is 5.
Babies$Gender = factor(rbinom(100, 1, 0.5)) # 100 Gender values are generated using random binomial function with equal probability of being assigned one or the other sex category
levels(Babies$Gender) = c("Girls", "Boys") #We levels function to assign Girls and Boys labels to 1 and 0 levels generated by the function
We can plot and inspect raw data:
par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16) # par parameter sets global plotting settings. mfrow indicates number of panels for plots (1 row and 3 columns), bty sets type of box around the plot, mar defines margines around the plot, cex magnifies size of labels, while pch sets type of points used in the plot.
plot(Babies$Age, Babies$Height, xlab = "Age (months)", ylab = "Height (cm)")
plot(Babies$Weight, Babies$Height, xlab = "Weight (grams)", ylab = "Height (cm)")
boxplot(Babies$Height ~ Babies$Gender, xlab = "Gender", ylab = "Height (cm)")
We can also check the coding and formatting of the variables in our data frame.
str(Babies)
## 'data.frame': 100 obs. of 4 variables:
## $ Age : num 4 7 22 26 24 11 3 9 8 12 ...
## $ Weight: num 3668 3872 4339 4448 4309 ...
## $ Height: num 61.5 56.1 59.3 55.2 60.3 ...
## $ Gender: Factor w/ 2 levels "Girls","Boys": 1 1 2 2 2 2 1 1 1 1 ...
Finally, we can also load the dataset from our local disc drives. The datasets should be placed in your working environment. There are a number of tutorials that guide you how to optimise creation of the new projects, which results in all of your data, code and files being generated and saved at one place. https://r4ds.had.co.nz/workflow-projects.html
inequality <- read.table("inequality.txt", sep = "\t", header = T) # read a file in table format and create a data frame from it. sep parameter defines separator in the data - tab delimited format in this case, while it could be also anything else, such as ',' - comma separated values, ' ' - space separated values etc. Header = T defines that the names of the variables are in the first row of the database.
# You can also numerous other extensions, suchs spss sav files, however you
# will often need additional packages for this. Foreign package -
# library(foreign) has read.spss function that can be used to read in spss
# files dataExample<-read.spss('Example.sav', to.data.frame=T)
Linear regression
One predictor
lm1 <- lm(Height ~ Age, data = Babies) # linear model where Height is modelled as a function of Age.
lm1$coefficients # We can print only the coefficients
## (Intercept) Age
## 57.025804 0.143174
summary(lm1$coefficients) #We can also part of the information that is frequently used to judge significance and importance of predictors
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1432 14.3638 28.5845 28.5845 42.8051 57.0258
Two predictors
lm2 <- lm(Height ~ Age + Gender, data = Babies) # linear model where Height is analysed as a function of Age and Gender
lm2$coefficients
## (Intercept) Age GenderBoys
## 57.5038799 0.1403955 -0.8309449
Main effects and interaction (numerical x categorical)
lm3 <- lm(Height ~ Age * Gender, data = Babies) # linear model where Height is modelled as a function of Age and Gender, as well as their interaction
lm3$coefficients
## (Intercept) Age GenderBoys Age:GenderBoys
## 56.1448603 0.2202400 1.8315868 -0.1607307
Main effects and interaction (numerical x numerical)
lm4 <- lm(Height ~ Age * Weight, data = Babies)
lm4$coefficients
## (Intercept) Age Weight Age:Weight
## 30.7244904854 0.6913148965 0.0066360329 -0.0001397745
Other information that we get from the linear model
lm1 <- lm(Height ~ Age, data = Babies)
summary(lm1)
##
## Call:
## lm(formula = Height ~ Age, data = Babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4765 -4.1601 -0.3703 3.9198 12.3842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.02580 1.18751 48.021 <2e-16 ***
## Age 0.14317 0.06426 2.228 0.0282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.283 on 98 degrees of freedom
## Multiple R-squared: 0.04821, Adjusted R-squared: 0.0385
## F-statistic: 4.964 on 1 and 98 DF, p-value: 0.02817
We can calculate R2 step-by-step. First, we need predictions from the model that includes predictor, as this will give us residual sum of squares.
Babies$lm1 = predict(lm1, newdata = Babies) # predict Height based on our lm1 model.
Babies$diff = Babies$lm1 - Babies$Height #calculate differences between predicted (lm1) and observed values (Height)
We can plot these differences using ggplot.
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.1
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm1,
colour = diff), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm1),
size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
xlab("Age") + ggtitle("SS_residual")
A second ingredient for our determination coefficient is sum of squares of total variation in the data. This we can get by building model with intercept only.
lm0 <- lm(Height ~ 1, data = Babies)
summary(lm0)
##
## Call:
## lm(formula = Height ~ 1, data = Babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.4165 -4.2284 -0.2062 3.6744 13.5940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.3953 0.5388 110.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.388 on 99 degrees of freedom
Babies$lm0 = predict(lm0, newdata = Babies) #predict Height based on average value only
Babies$diff2 = Babies$lm0 - Babies$Height #calculate difference between predicted (lm0) and observed values (Height)
We can plot these differences using ggplot.
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm0,
colour = diff2), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0),
size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
xlab("Age") + ggtitle("SS_total")
The R2 coefficient is:
1 - (sum(Babies$diff^2)/(sum(Babies$diff2^2)))
## [1] 0.04821098
Improvement in our prediction:
Babies$diff3 = Babies$lm1 - Babies$lm0 #Improvement based on the inclusion of Age as a predictor - differences between the predicted values from (lm1) and intercept only model (lm0)
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = lm1, ymax = lm0, colour = diff3),
size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0), size = 1.2) + geom_line(data = Babies,
aes(Age, y = lm1), size = 1.3, linetype = "dotdash") + geom_point(data = Babies,
aes(x = Age, y = Height), size = 0.9) + ylab("Height") + xlab("Age") + ggtitle("Improvement") +
theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))
F-value:
(sum(Babies$diff3^2)/1)/(sum(Babies$diff^2)/98)
## [1] 4.963995
Practical aspect of the lecture
inequality <- read.table("inequality.txt", sep = ",", header = T) #Reading the data in R
head(inequality)
## state median_household_income share_unemployed_seasonal
## 1 Alabama 42278 0.060
## 2 Alaska 67629 0.064
## 3 Arizona 49254 0.063
## 4 Arkansas 44922 0.052
## 5 California 60487 0.059
## 6 Colorado 60940 0.040
## share_population_in_metro_areas share_population_with_high_school_degree
## 1 0.64 0.821
## 2 0.63 0.914
## 3 0.90 0.842
## 4 0.69 0.824
## 5 0.97 0.806
## 6 0.80 0.893
## share_non_citizen share_white_poverty gini_index share_non_white
## 1 0.02 0.12 0.472 0.35
## 2 0.04 0.06 0.422 0.42
## 3 0.10 0.09 0.455 0.49
## 4 0.04 0.12 0.458 0.26
## 5 0.13 0.09 0.471 0.61
## 6 0.06 0.07 0.457 0.31
## share_voters_voted_trump hate_crimes_per_100k_splc
## 1 0.63 0.12583893
## 2 0.53 0.14374012
## 3 0.50 0.22531995
## 4 0.60 0.06906077
## 5 0.33 0.25580536
## 6 0.44 0.39052330
## avg_hatecrimes_per_100k_fbi
## 1 1.8064105
## 2 1.6567001
## 3 3.4139280
## 4 0.8692089
## 5 2.3979859
## 6 2.8046888
Probability density plots: outcomes
par(mfrow = c(1, 2))
plot(density(inequality$hate_crimes_per_100k_splc, na.rm = T), main = "Crimes per 100k") #probability density for hate crimes outcomes. na.rm=T indicates that only cases that are not NAs should be returned
plot(density(inequality$avg_hatecrimes_per_100k_fbi, na.rm = T), main = "Average Crimes")
Probability density plots: predictors
par(mfrow = c(1, 2))
plot(density(inequality$median_household_income, na.rm = T), main = "Income")
plot(density(inequality$gini_index, na.rm = T), main = "Gini")
Scatter plots:
par(mfrow = c(1, 2))
plot(inequality$median_household_income, inequality$avg_hatecrimes_per_100k_fbi,
xlab = "Median household income", ylab = "Avg hatecrimes")
plot(inequality$gini_index, inequality$avg_hatecrimes_per_100k_fbi, xlab = "Gini index",
ylab = "Avg hatecrimes")
cor(inequality[, c(2, 8, 12)], use = "complete.obs") #calculate correlations where we use only rows that have all values (no NAs). We are only focusing on columns: 2,8 and 12. inequality is a data frame with (n rows, m columns), so we can access only first row by calling inequality[1,], or just first column by inequality[,1], first row and first column would be inequality[1,1]. When using inequality[,c(2,8,12)] am asking for all rows but only second, eight and twelfth column.
## median_household_income gini_index
## median_household_income 1.0000000 -0.1497444
## gini_index -0.1497444 1.0000000
## avg_hatecrimes_per_100k_fbi 0.3182464 0.4212719
## avg_hatecrimes_per_100k_fbi
## median_household_income 0.3182464
## gini_index 0.4212719
## avg_hatecrimes_per_100k_fbi 1.0000000
Stepwise approach in building linear model
mod1 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income, data = inequality) #modelling avg hatecrimes as a function of median_household_income
summary(mod1)
##
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income,
## data = inequality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3300 -1.1183 -0.1656 0.7827 7.7762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.564e-01 1.448e+00 -0.661 0.5121
## median_household_income 6.054e-05 2.603e-05 2.326 0.0243 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.642 on 48 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1013, Adjusted R-squared: 0.08256
## F-statistic: 5.409 on 1 and 48 DF, p-value: 0.0243
Adding a new predictor and comparing it with the previous model.
mod2 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index, data = inequality)
anova(mod2) #anova type comparison of the model that allows us to see whether newly added predictor explained additional variation in the outcome. We can see changes in the Sum Sq for residuals and for the main effects
## Analysis of Variance Table
##
## Response: avg_hatecrimes_per_100k_fbi
## Df Sum Sq Mean Sq F value Pr(>F)
## median_household_income 1 14.584 14.584 7.0649 0.0107093 *
## gini_index 1 32.389 32.389 15.6906 0.0002517 ***
## Residuals 47 97.020 2.064
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also test whether there is need to adjust influence of median household income across the values of gini index - interaction between our predictors
mod3 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index, data = inequality)
anova(mod1, mod2, mod3) # comparison between three models
## Analysis of Variance Table
##
## Model 1: avg_hatecrimes_per_100k_fbi ~ median_household_income
## Model 2: avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index
## Model 3: avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 129.41
## 2 47 97.02 1 32.389 19.742 5.539e-05 ***
## 3 46 75.47 1 21.550 13.135 0.0007218 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Summary of the model:
summary(mod3)
##
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income *
## gini_index, data = inequality)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.16383 -0.96279 -0.01053 0.88008 2.75763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.056e+01 3.070e+01 2.950 0.004986 **
## median_household_income -1.724e-03 4.965e-04 -3.472 0.001136 **
## gini_index -2.001e+02 6.666e+01 -3.002 0.004330 **
## median_household_income:gini_index 3.907e-03 1.078e-03 3.624 0.000722 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.281 on 46 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4759, Adjusted R-squared: 0.4417
## F-statistic: 13.92 on 3 and 46 DF, p-value: 1.367e-06
We can visualise the interactions using interactions package.
require(interactions)
## Loading required package: interactions
## Warning: package 'interactions' was built under R version 4.1.1
interact_plot(mod3, pred = median_household_income, modx = gini_index, plot.points = T)
Interactive visualisation:
p <- ggplot(simulatedD, aes(median_household_income, Avg_hate_pred, color = gini_index,
frame = gini_index)) + geom_point() # Using ggplot to make the plot
plotly::ggplotly(p) #make it interactive using plotly
Model diagnostics
Quantile-quantile plot: Normality
car::qqPlot(mod3)
## [1] 9 18
Homoscedasticity: Linearity
car::residualPlot(mod3, type = "rstudent") # we can call car package without loading it into R environment by calling car::
Outliers:
car::influenceIndexPlot(mod3) #influence of individual observation on our model
Autocorrelation of residuals
par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
stats::acf(resid(mod3))
Predicted versus observed data:
par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(predict(mod3), mod3$model$avg_hatecrimes_per_100k_fbi, xlab = "Predicted values (model 3)",
ylab = "Observed values (avg hatecrimes)") #The x-axis is predict(mod3) - predict values based on our model. The y-axis is the data that was used to build the model. I decided to take these values from our model frame (mod3), instead of original data frame (inequalities).
Finally, we can subset our model and exclude data point that might skew our results
mod3.cor <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index,
data = inequality, subset = -9) #We subset our data to exclude row 9
summary(mod3.cor)
##
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income *
## gini_index, data = inequality, subset = -9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.90964 -0.94966 -0.08526 0.73470 2.55257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.429e+01 3.080e+01 1.438 0.157
## median_household_income -7.992e-04 5.228e-04 -1.529 0.133
## gini_index -9.668e+01 6.725e+01 -1.438 0.157
## median_household_income:gini_index 1.837e-03 1.145e-03 1.604 0.116
##
## Residual standard error: 1.154 on 45 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1285, Adjusted R-squared: 0.07041
## F-statistic: 2.212 on 3 and 45 DF, p-value: 0.09974