This document outlines the R code that was used across the lectures for Advanced Statistics course at MSc in Psychological Research Methods programme at the University of Sheffield, UK. The code follows the lectures, but also expands on the content covered by the lecture.

Lecture 1: Linear regression

Introduction

First, we used the dataset that are included in R packages. We can see what data do we have in standard R packages by calling: data() or we can see datasets specific for certain R packages by calling: data(package=‘lme4’).

In this case, we can call pre-loaded dataset cars. More information on this data is available by calling ?cars or help(cars).

data("cars")  #calls the dataset
head(cars)  # prints first 6 observations in the dataset
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

We can use scatter plot to see the raw values.

plot(cars$speed, cars$dist, xlab = "Predictor", ylab = "Outcome")  #As this plot was only used to make an illustration of the linear regression, we changed the labels of x and y-axis. This was done using xlab and ylab parameters. You can change the name of the labels using the same code, eg. xlab='Speed (mph)', ylab='Stopping distance (ft)'
abline(lm(dist ~ speed, data = cars), lwd = 2)  #abline function adds one or more straight lines through the plot that you have active in your environment. lm function is used to fit linear models, where dist is modelled as a function of speed. We also indicate that values for distance and speed can be found in the cars dataset (data = cars). Finally, we specify the thickness of abline function with lwd=2 parameter. 

Data simulation

We can also simulate some data:

set.seed(456)  # we can set starting numbers that are used to generate our random values. Random numbers are rarely truly random: https://engineering.mit.edu/engage/ask-an-engineer/can-a-computer-generate-a-truly-random-number/
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500))  #We create a new data frame (Babies) where we have Age and Weight as variables. 100 Age values are sampled from random uniform distribution (runif) with lower bound of 1 (minimum) and upper bound of 30 (maximum). 100 Weight values are generated using random normal distribution (rnorm) with mean of 4000 and SD of 500 
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5)  # Height is generated using random normal distribution where mean is a function of Age and Weight, while SD is 5. 
Babies$Gender = factor(rbinom(100, 1, 0.5))  # 100 Gender values are generated using random binomial function with equal probability of being assigned one or the other sex category
levels(Babies$Gender) = c("Girls", "Boys")  #We levels function to assign Girls and Boys labels to 1 and 0 levels generated by the function

We can plot and inspect raw data:

par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)  # par parameter sets global plotting settings. mfrow indicates number of panels for plots (1 row and 3 columns), bty sets type of box around the plot, mar defines margines around the plot, cex magnifies size of labels, while pch sets type of points used in the plot.
plot(Babies$Age, Babies$Height, xlab = "Age (months)", ylab = "Height (cm)")
plot(Babies$Weight, Babies$Height, xlab = "Weight (grams)", ylab = "Height (cm)")
boxplot(Babies$Height ~ Babies$Gender, xlab = "Gender", ylab = "Height (cm)")

We can also check the coding and formatting of the variables in our data frame.

str(Babies)
## 'data.frame':    100 obs. of  4 variables:
##  $ Age   : num  4 7 22 26 24 11 3 9 8 12 ...
##  $ Weight: num  3668 3872 4339 4448 4309 ...
##  $ Height: num  61.5 56.1 59.3 55.2 60.3 ...
##  $ Gender: Factor w/ 2 levels "Girls","Boys": 1 1 2 2 2 2 1 1 1 1 ...

Finally, we can also load the dataset from our local disc drives. The datasets should be placed in your working environment. There are a number of tutorials that guide you how to optimise creation of the new projects, which results in all of your data, code and files being generated and saved at one place. https://r4ds.had.co.nz/workflow-projects.html

inequality <- read.table("inequality.txt", sep = "\t", header = T)  # read a file in table format and create a data frame from it. sep parameter defines separator in the data - tab delimited format in this case, while it could be also anything else, such as ',' - comma separated values, ' ' - space separated values etc. Header = T defines that the names of the variables are in the first row of the database.

# You can also numerous other extensions, suchs spss sav files, however you
# will often need additional packages for this.  Foreign package -
# library(foreign) has read.spss function that can be used to read in spss
# files dataExample<-read.spss('Example.sav', to.data.frame=T)

Linear regression

One predictor

lm1 <- lm(Height ~ Age, data = Babies)  # linear model where Height is modelled as a function of Age. 
lm1$coefficients  # We can print only the coefficients
## (Intercept)         Age 
##   57.025804    0.143174
summary(lm1$coefficients)  #We can also part of the information that is frequently used to judge significance and importance of predictors  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1432 14.3638 28.5845 28.5845 42.8051 57.0258

Two predictors

lm2 <- lm(Height ~ Age + Gender, data = Babies)  # linear model where Height is analysed as a function of Age and Gender
lm2$coefficients
## (Intercept)         Age  GenderBoys 
##  57.5038799   0.1403955  -0.8309449

Main effects and interaction (numerical x categorical)

lm3 <- lm(Height ~ Age * Gender, data = Babies)  # linear model where Height is modelled as a function of Age and Gender, as well as their interaction
lm3$coefficients
##    (Intercept)            Age     GenderBoys Age:GenderBoys 
##     56.1448603      0.2202400      1.8315868     -0.1607307

Main effects and interaction (numerical x numerical)

lm4 <- lm(Height ~ Age * Weight, data = Babies)
lm4$coefficients
##   (Intercept)           Age        Weight    Age:Weight 
## 30.7244904854  0.6913148965  0.0066360329 -0.0001397745

Other information that we get from the linear model

lm1 <- lm(Height ~ Age, data = Babies)
summary(lm1)
## 
## Call:
## lm(formula = Height ~ Age, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4765  -4.1601  -0.3703   3.9198  12.3842 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.02580    1.18751  48.021   <2e-16 ***
## Age          0.14317    0.06426   2.228   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.283 on 98 degrees of freedom
## Multiple R-squared:  0.04821,    Adjusted R-squared:  0.0385 
## F-statistic: 4.964 on 1 and 98 DF,  p-value: 0.02817

We can calculate R2 step-by-step. First, we need predictions from the model that includes predictor, as this will give us residual sum of squares.

Babies$lm1 = predict(lm1, newdata = Babies)  # predict Height based on our lm1 model.
Babies$diff = Babies$lm1 - Babies$Height  #calculate differences between predicted (lm1) and observed values (Height) 

We can plot these differences using ggplot.

require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.1
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm1,
    colour = diff), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm1),
    size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
    xlab("Age") + ggtitle("SS_residual")

A second ingredient for our determination coefficient is sum of squares of total variation in the data. This we can get by building model with intercept only.

lm0 <- lm(Height ~ 1, data = Babies)
summary(lm0)
## 
## Call:
## lm(formula = Height ~ 1, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4165  -4.2284  -0.2062   3.6744  13.5940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.3953     0.5388   110.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.388 on 99 degrees of freedom
Babies$lm0 = predict(lm0, newdata = Babies)  #predict Height based on average value only
Babies$diff2 = Babies$lm0 - Babies$Height  #calculate difference between predicted (lm0) and observed values (Height)

We can plot these differences using ggplot.

ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = Height, ymax = lm0,
    colour = diff2), size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0),
    size = 1.2) + geom_point(data = Babies, aes(Age, y = Height), size = 2) + ylab("Height") +
    xlab("Age") + ggtitle("SS_total")

The R2 coefficient is:

1 - (sum(Babies$diff^2)/(sum(Babies$diff2^2)))
## [1] 0.04821098

Improvement in our prediction:

Babies$diff3 = Babies$lm1 - Babies$lm0  #Improvement based on the inclusion of Age as a predictor - differences between the predicted values from (lm1) and intercept only model (lm0)
ggplot() + geom_linerange(data = Babies, aes(x = Age, ymin = lm1, ymax = lm0, colour = diff3),
    size = 1.2) + geom_line(data = Babies, aes(x = Age, y = lm0), size = 1.2) + geom_line(data = Babies,
    aes(Age, y = lm1), size = 1.3, linetype = "dotdash") + geom_point(data = Babies,
    aes(x = Age, y = Height), size = 0.9) + ylab("Height") + xlab("Age") + ggtitle("Improvement") +
    theme(axis.title = element_text(size = 14), axis.text = element_text(size = 12))

F-value:

(sum(Babies$diff3^2)/1)/(sum(Babies$diff^2)/98)
## [1] 4.963995

Practical aspect of the lecture

inequality <- read.table("inequality.txt", sep = ",", header = T)  #Reading the data in R
head(inequality)
##        state median_household_income share_unemployed_seasonal
## 1    Alabama                   42278                     0.060
## 2     Alaska                   67629                     0.064
## 3    Arizona                   49254                     0.063
## 4   Arkansas                   44922                     0.052
## 5 California                   60487                     0.059
## 6   Colorado                   60940                     0.040
##   share_population_in_metro_areas share_population_with_high_school_degree
## 1                            0.64                                    0.821
## 2                            0.63                                    0.914
## 3                            0.90                                    0.842
## 4                            0.69                                    0.824
## 5                            0.97                                    0.806
## 6                            0.80                                    0.893
##   share_non_citizen share_white_poverty gini_index share_non_white
## 1              0.02                0.12      0.472            0.35
## 2              0.04                0.06      0.422            0.42
## 3              0.10                0.09      0.455            0.49
## 4              0.04                0.12      0.458            0.26
## 5              0.13                0.09      0.471            0.61
## 6              0.06                0.07      0.457            0.31
##   share_voters_voted_trump hate_crimes_per_100k_splc
## 1                     0.63                0.12583893
## 2                     0.53                0.14374012
## 3                     0.50                0.22531995
## 4                     0.60                0.06906077
## 5                     0.33                0.25580536
## 6                     0.44                0.39052330
##   avg_hatecrimes_per_100k_fbi
## 1                   1.8064105
## 2                   1.6567001
## 3                   3.4139280
## 4                   0.8692089
## 5                   2.3979859
## 6                   2.8046888

Probability density plots: outcomes

par(mfrow = c(1, 2))
plot(density(inequality$hate_crimes_per_100k_splc, na.rm = T), main = "Crimes per 100k")  #probability density for hate crimes outcomes. na.rm=T indicates that only cases that are not NAs should be returned
plot(density(inequality$avg_hatecrimes_per_100k_fbi, na.rm = T), main = "Average Crimes")

Probability density plots: predictors

par(mfrow = c(1, 2))
plot(density(inequality$median_household_income, na.rm = T), main = "Income")
plot(density(inequality$gini_index, na.rm = T), main = "Gini")

Scatter plots:

par(mfrow = c(1, 2))
plot(inequality$median_household_income, inequality$avg_hatecrimes_per_100k_fbi,
    xlab = "Median household income", ylab = "Avg hatecrimes")
plot(inequality$gini_index, inequality$avg_hatecrimes_per_100k_fbi, xlab = "Gini index",
    ylab = "Avg hatecrimes")

cor(inequality[, c(2, 8, 12)], use = "complete.obs")  #calculate correlations where we use only rows that have all values (no NAs). We are only focusing on columns: 2,8 and 12. inequality is a data frame with (n rows, m columns), so we can access only first row by calling inequality[1,], or just first column by inequality[,1], first row and first column would be inequality[1,1]. When using inequality[,c(2,8,12)] am asking for all rows but only second, eight and twelfth column.
##                             median_household_income gini_index
## median_household_income                   1.0000000 -0.1497444
## gini_index                               -0.1497444  1.0000000
## avg_hatecrimes_per_100k_fbi               0.3182464  0.4212719
##                             avg_hatecrimes_per_100k_fbi
## median_household_income                       0.3182464
## gini_index                                    0.4212719
## avg_hatecrimes_per_100k_fbi                   1.0000000

Stepwise approach in building linear model

mod1 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income, data = inequality)  #modelling avg hatecrimes as a function of median_household_income
summary(mod1)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income, 
##     data = inequality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3300 -1.1183 -0.1656  0.7827  7.7762 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             -9.564e-01  1.448e+00  -0.661   0.5121  
## median_household_income  6.054e-05  2.603e-05   2.326   0.0243 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.642 on 48 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1013, Adjusted R-squared:  0.08256 
## F-statistic: 5.409 on 1 and 48 DF,  p-value: 0.0243

Adding a new predictor and comparing it with the previous model.

mod2 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index, data = inequality)
anova(mod2)  #anova type comparison of the model that allows us to see whether newly added predictor explained additional variation in the outcome. We can see changes in the Sum Sq for residuals and for the main effects
## Analysis of Variance Table
## 
## Response: avg_hatecrimes_per_100k_fbi
##                         Df Sum Sq Mean Sq F value    Pr(>F)    
## median_household_income  1 14.584  14.584  7.0649 0.0107093 *  
## gini_index               1 32.389  32.389 15.6906 0.0002517 ***
## Residuals               47 97.020   2.064                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also test whether there is need to adjust influence of median household income across the values of gini index - interaction between our predictors

mod3 <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index, data = inequality)
anova(mod1, mod2, mod3)  # comparison between three models
## Analysis of Variance Table
## 
## Model 1: avg_hatecrimes_per_100k_fbi ~ median_household_income
## Model 2: avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index
## Model 3: avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     48 129.41                                  
## 2     47  97.02  1    32.389 19.742 5.539e-05 ***
## 3     46  75.47  1    21.550 13.135 0.0007218 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of the model:

summary(mod3)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * 
##     gini_index, data = inequality)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.16383 -0.96279 -0.01053  0.88008  2.75763 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         9.056e+01  3.070e+01   2.950 0.004986 ** 
## median_household_income            -1.724e-03  4.965e-04  -3.472 0.001136 ** 
## gini_index                         -2.001e+02  6.666e+01  -3.002 0.004330 ** 
## median_household_income:gini_index  3.907e-03  1.078e-03   3.624 0.000722 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.281 on 46 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4759, Adjusted R-squared:  0.4417 
## F-statistic: 13.92 on 3 and 46 DF,  p-value: 1.367e-06

We can visualise the interactions using interactions package.

require(interactions)
## Loading required package: interactions
## Warning: package 'interactions' was built under R version 4.1.1
interact_plot(mod3, pred = median_household_income, modx = gini_index, plot.points = T)

Interactive visualisation:

p <- ggplot(simulatedD, aes(median_household_income, Avg_hate_pred, color = gini_index,
    frame = gini_index)) + geom_point()  # Using ggplot to make the plot
plotly::ggplotly(p)  #make it interactive using plotly

Model diagnostics

Quantile-quantile plot: Normality

car::qqPlot(mod3)

## [1]  9 18

Homoscedasticity: Linearity

car::residualPlot(mod3, type = "rstudent")  # we can call car package without loading it into R environment by calling car::

Outliers:

car::influenceIndexPlot(mod3)  #influence of individual observation on our model

Autocorrelation of residuals

par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
stats::acf(resid(mod3))

Predicted versus observed data:

par(bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(predict(mod3), mod3$model$avg_hatecrimes_per_100k_fbi, xlab = "Predicted values (model 3)",
    ylab = "Observed values (avg hatecrimes)")  #The x-axis is predict(mod3) - predict values based on our model. The y-axis is the data that was used to build the model. I decided to take these values from our model frame (mod3), instead of original data frame (inequalities). 

Finally, we can subset our model and exclude data point that might skew our results

mod3.cor <- lm(avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index,
    data = inequality, subset = -9)  #We subset our data to exclude row 9
summary(mod3.cor)
## 
## Call:
## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * 
##     gini_index, data = inequality, subset = -9)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.90964 -0.94966 -0.08526  0.73470  2.55257 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         4.429e+01  3.080e+01   1.438    0.157
## median_household_income            -7.992e-04  5.228e-04  -1.529    0.133
## gini_index                         -9.668e+01  6.725e+01  -1.438    0.157
## median_household_income:gini_index  1.837e-03  1.145e-03   1.604    0.116
## 
## Residual standard error: 1.154 on 45 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1285, Adjusted R-squared:  0.07041 
## F-statistic: 2.212 on 3 and 45 DF,  p-value: 0.09974

Lecture 3: Path model

Introduction

Adding more variables to our data. It just keeps growing:

library(truncnorm)
require(lavaan)
set.seed(456)
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500))
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5)
Babies$Sex = rbinom(100, 1, 0.5)
Babies$Nutrition = rtruncnorm(n = 100, a = 0, b = 30, mean = 5, sd = 10)
Babies$PhyExer = rnorm(100, 180, 50)
Babies$GMA = rnorm(100, 180, 50)
Babies$SocialBeh = rnorm(100, 180 + Babies$PhyExer, 80)
Babies$TummySleep = rbinom(100, 1, 0.5)
Babies$CognitiveAb = rnorm(100, 10 + 7 * Babies$Nutrition + 0.1 * Babies$PhyExer +
    3 * Babies$GMA + 0.03 * Babies$PhyExer * Babies$SocialBeh, 5)
Babies$Sex = as.factor(Babies$Sex)
levels(Babies$Sex) = c("Girls", "Boys")

Fitting the model:

# install.packages('lavaan') - we need lavaan package to fit structural
# equation models, both path models and confirmatory factor analysis
require(lavaan)
# model has to be written in a separate object. We have quotation marks ('') to
# indicate that this is a textual input.
modelAbility <- "
SocialBeh~Nutrition+PhyExer+GMA
CognitiveAb~SocialBeh+Nutrition+GMA
"
# we use sem function to fit the model, while we specify that data is in the
# dataset Babies
fit1 <- sem(modelAbility, data = Babies)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan WARNING: some observed variances are larger than 1000000
##   lavaan NOTE: use varTable(fit) to investigate
summary(fit1)
## lavaan 0.6-9 ended normally after 46 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               215.236
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate    Std.Err   z-value  P(>|z|)
##   SocialBeh ~                                            
##     Nutrition           0.030     1.105    0.027    0.978
##     PhyExer             1.281     0.146    8.796    0.000
##     GMA                -0.075     0.151   -0.500    0.617
##   CognitiveAb ~                                          
##     SocialBeh           9.579     0.469   20.428    0.000
##     Nutrition           7.019     6.899    1.017    0.309
##     GMA                 2.461     0.941    2.614    0.009
## 
## Variances:
##                    Estimate    Std.Err   z-value  P(>|z|)
##    .SocialBeh        5515.809   780.053    7.071    0.000
##    .CognitiveAb    215129.015 30423.837    7.071    0.000

We can also plot the estimates of our model:

# install.packages('tidySEM')
require("tidySEM")
graph_sem(fit1, variance_diameter = 0.2)

Indices of the model fit:

summary(fit1, fit.measures = TRUE)  # we need to include fit.measures=TRUE for lavaan to print them out
## lavaan 0.6-9 ended normally after 46 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               215.236
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               438.108
##   Degrees of freedom                                 7
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.503
##   Tucker-Lewis Index (TLI)                      -2.479
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1328.506
##   Loglikelihood unrestricted model (H1)      -1220.888
##                                                       
##   Akaike (AIC)                                2673.012
##   Bayesian (BIC)                              2693.853
##   Sample-size adjusted Bayesian (BIC)         2668.587
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          1.464
##   90 Percent confidence interval - lower         1.303
##   90 Percent confidence interval - upper         1.632
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.080
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate    Std.Err   z-value  P(>|z|)
##   SocialBeh ~                                            
##     Nutrition           0.030     1.105    0.027    0.978
##     PhyExer             1.281     0.146    8.796    0.000
##     GMA                -0.075     0.151   -0.500    0.617
##   CognitiveAb ~                                          
##     SocialBeh           9.579     0.469   20.428    0.000
##     Nutrition           7.019     6.899    1.017    0.309
##     GMA                 2.461     0.941    2.614    0.009
## 
## Variances:
##                    Estimate    Std.Err   z-value  P(>|z|)
##    .SocialBeh        5515.809   780.053    7.071    0.000
##    .CognitiveAb    215129.015 30423.837    7.071    0.000

Model modifications:

modelAbility2 <- "
SocialBeh~Nutrition+PhyExer+GMA
CognitiveAb~SocialBeh+Nutrition+GMA+PhyExer
"
fit2 <- sem(modelAbility2, data = Babies)
summary(fit2, fit.measures = TRUE)
## lavaan 0.6-9 ended normally after 56 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         9
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Model Test Baseline Model:
## 
##   Test statistic                               438.108
##   Degrees of freedom                                 7
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.000
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1220.888
##   Loglikelihood unrestricted model (H1)      -1220.888
##                                                       
##   Akaike (AIC)                                2459.776
##   Bayesian (BIC)                              2483.222
##   Sample-size adjusted Bayesian (BIC)         2454.798
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.000
##   P-value RMSEA <= 0.05                             NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate   Std.Err  z-value  P(>|z|)
##   SocialBeh ~                                          
##     Nutrition          0.030    1.105    0.027    0.978
##     PhyExer            1.281    0.146    8.796    0.000
##     GMA               -0.075    0.151   -0.500    0.617
##   CognitiveAb ~                                        
##     SocialBeh          5.701    0.213   26.781    0.000
##     Nutrition          8.548    2.352    3.634    0.000
##     GMA                2.814    0.321    8.764    0.000
##     PhyExer           11.390    0.413   27.577    0.000
## 
## Variances:
##                    Estimate   Std.Err  z-value  P(>|z|)
##    .SocialBeh       5515.809  780.053    7.071    0.000
##    .CognitiveAb    24999.990 3535.532    7.071    0.000

Model comparison:

lavTestLRT(fit1, fit2)  # we are comparing fit of the first and the second model
## Chi-Squared Difference Test
## 
##      Df    AIC    BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
## fit2  0 2459.8 2483.2   0.00                                  
## fit1  1 2673.0 2693.8 215.24     215.24       1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Check the modification indices:

modindices(fit1, sort = TRUE)  #we are checking them for the first model, high values indicate potential pathways that we could include and improve the model fit
##            lhs op         rhs     mi    epc sepc.lv sepc.all sepc.nox
## 17 CognitiveAb  ~     PhyExer 88.379 11.390  11.390    0.553    0.011
## 16   SocialBeh  ~ CognitiveAb 88.379 -0.228  -0.228   -2.420   -2.420
## 23     PhyExer  ~ CognitiveAb 82.143  0.128   0.128    2.635    2.635
## 27         GMA  ~ CognitiveAb  1.601  0.025   0.025    0.529    0.529
## 19   Nutrition  ~ CognitiveAb  1.002  0.007   0.007    1.114    1.114
## 18   Nutrition  ~   SocialBeh  0.000  0.000   0.000    0.000    0.000
## 22     PhyExer  ~   SocialBeh  0.000  0.000   0.000    0.000    0.000
## 26         GMA  ~   SocialBeh  0.000  0.000   0.000    0.000    0.000
## 21   Nutrition  ~         GMA  0.000  0.000   0.000    0.000    0.000
## 25     PhyExer  ~         GMA  0.000  0.000   0.000    0.000    0.000
## 13     PhyExer ~~         GMA  0.000  0.000   0.000       NA    0.000
## 29         GMA  ~     PhyExer  0.000  0.000   0.000    0.000    0.000
## 10   Nutrition ~~     PhyExer  0.000  0.000   0.000       NA    0.000
## 24     PhyExer  ~   Nutrition  0.000  0.000   0.000    0.000    0.000
## 20   Nutrition  ~     PhyExer  0.000  0.000   0.000    0.000    0.000

Calculation of the indirect pathways:

modelAbilityPath <- "
SocialBeh~Nutrition+a*PhyExer+GMA
CognitiveAb~b*SocialBeh+c*Nutrition+GMA

indirect := a*b
direct := c
total := indirect + direct
"
# multiplication sign (*) labels the parameter - assigns regression coefficient
# (PhyExer) to value a.  using := we can specify calculations in the sem. You
# can do any other type of calculaton also, eg. a^2

fitPath <- sem(modelAbilityPath, data = Babies)
summary(fitPath)
## lavaan 0.6-9 ended normally after 46 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         8
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               215.236
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate    Std.Err   z-value  P(>|z|)
##   SocialBeh ~                                            
##     Nutrition           0.030     1.105    0.027    0.978
##     PhyExer    (a)      1.281     0.146    8.796    0.000
##     GMA                -0.075     0.151   -0.500    0.617
##   CognitiveAb ~                                          
##     SocialBeh  (b)      9.579     0.469   20.428    0.000
##     Nutrition  (c)      7.019     6.899    1.017    0.309
##     GMA                 2.461     0.941    2.614    0.009
## 
## Variances:
##                    Estimate    Std.Err   z-value  P(>|z|)
##    .SocialBeh        5515.809   780.053    7.071    0.000
##    .CognitiveAb    215129.015 30423.837    7.071    0.000
## 
## Defined Parameters:
##                    Estimate    Std.Err   z-value  P(>|z|)
##     indirect           12.275     1.519    8.079    0.000
##     direct              7.019     6.899    1.017    0.309
##     total              19.294     7.074    2.727    0.006

PiecewiseSEM package

# install.packages('piecewiseSEM)
require(piecewiseSEM)
model1 <- psem(lm(SocialBeh ~ Nutrition + PhyExer + GMA, data = Babies), lm(CognitiveAb ~
    SocialBeh + Nutrition + GMA, data = Babies))  # combined linear regression functions
summary(model1, .progressBar = FALSE)
## 
## Structural Equation Model of model1 
## 
## Call:
##   SocialBeh ~ Nutrition + PhyExer + GMA
##   CognitiveAb ~ SocialBeh + Nutrition + GMA
## 
##     AIC      BIC
##  229.364   255.416
## 
## ---
## Tests of directed separation:
## 
##                Independ.Claim Test.Type DF Crit.Value P.Value    
##   CognitiveAb ~ PhyExer + ...      coef 95    26.8792       0 ***
## 
## Global goodness-of-fit:
## 
##   Fisher's C = 209.364 with P-value = 0 and on 2 degrees of freedom
## 
## ---
## Coefficients:
## 
##      Response Predictor Estimate Std.Error DF Crit.Value P.Value Std.Estimate
##     SocialBeh Nutrition   0.0300    1.1278 96     0.0266  0.9789       0.0020
##     SocialBeh   PhyExer   1.2814    0.1487 96     8.6187  0.0000       0.6604
##     SocialBeh       GMA  -0.0753    0.1538 96    -0.4899  0.6253      -0.0375
##   CognitiveAb SocialBeh   9.5792    0.4786 96    20.0156  0.0000       0.9023
##   CognitiveAb Nutrition   7.0193    7.0413 96     0.9969  0.3213       0.0447
##   CognitiveAb       GMA   2.4607    0.9607 96     2.5614  0.0120       0.1155
##      
##      
##   ***
##      
##   ***
##      
##     *
## 
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
## 
## ---
## Individual R-squared:
## 
##      Response method R.squared
##     SocialBeh   none      0.44
##   CognitiveAb   none      0.81

Practical aspect of the lecture

Getting the data

NBAPath <- read.table("NBApath.txt", sep = "\t", header = T)

Summary of the NBA data

summary(NBAPath)
##      TEAM                PCT            Player              Pos           
##  Length:3810        Min.   :0.1061   Length:3810        Length:3810       
##  Class :character   1st Qu.:0.3780   Class :character   Class :character  
##  Mode  :character   Median :0.5000   Mode  :character   Mode  :character  
##                     Mean   :0.4905                                        
##                     3rd Qu.:0.6098                                        
##                     Max.   :0.8902                                        
##       Age              GP             PER        
##  Min.   :18.00   Min.   : 1.00   Min.   :-13.10  
##  1st Qu.:23.00   1st Qu.:34.00   1st Qu.: 10.00  
##  Median :25.00   Median :61.00   Median : 12.80  
##  Mean   :26.05   Mean   :53.73   Mean   : 12.75  
##  3rd Qu.:29.00   3rd Qu.:77.00   3rd Qu.: 15.80  
##  Max.   :43.00   Max.   :82.00   Max.   : 35.20

Correlation matrix

cor(NBAPath[, c(2, 5:7)])  #correlation between second, fifth, and seventh variable in the NBAPath dataframe
##            PCT        Age         GP        PER
## PCT 1.00000000 0.14304325 0.08849459 0.07720633
## Age 0.14304325 1.00000000 0.05170204 0.03598025
## GP  0.08849459 0.05170204 1.00000000 0.45360129
## PER 0.07720633 0.03598025 0.45360129 1.00000000

Univariate plots

par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(density(NBAPath$PER), main = "")
plot(density(NBAPath$PCT), main = "")

Bivariate plots

par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(NBAPath$Age, NBAPath$PER)
plot(NBAPath$GP, NBAPath$PER)

Estimating the model

NBAmod1 <- "
GP~b*Age
PER~a*Age+c*GP

dir := a
ind := b*c
tot := dir + ind
"
NBAfit1 <- sem(NBAmod1, data = NBAPath)
summary(NBAfit1)
## lavaan 0.6-9 ended normally after 21 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
##                                                       
##   Number of observations                          3810
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   GP ~                                                
##     Age        (b)    0.315    0.098    3.196    0.001
##   PER ~                                               
##     Age        (a)    0.016    0.018    0.869    0.385
##     GP         (c)    0.093    0.003   31.333    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .GP              645.883   14.798   43.646    0.000
##    .PER              21.834    0.500   43.646    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     dir               0.016    0.018    0.869    0.385
##     ind               0.029    0.009    3.179    0.001
##     tot               0.045    0.020    2.222    0.026

R2 and other measures of fit

inspect(NBAfit1, "r2")
##    GP   PER 
## 0.003 0.206
-2 * logLik(NBAfit1)  # deviance 
## 'log Lik.' 58025.67 (df=5)
AIC(NBAfit1)  # Akaike information criterion
## [1] 58035.67

Respecification of the model

NBAmod2 <- "
GP~b*Age
PER~c*GP

ind := b*c
"
NBAfit2 <- sem(NBAmod2, data = NBAPath)
summary(NBAfit2, fit.measures = T)
## lavaan 0.6-9 ended normally after 21 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         4
##                                                       
##   Number of observations                          3810
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.755
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.385
## 
## Model Test Baseline Model:
## 
##   Test statistic                               888.633
##   Degrees of freedom                                 3
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.001
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -29013.211
##   Loglikelihood unrestricted model (H1)     -29012.833
##                                                       
##   Akaike (AIC)                               58034.422
##   Bayesian (BIC)                             58059.403
##   Sample-size adjusted Bayesian (BIC)        58046.693
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.041
##   P-value RMSEA <= 0.05                          0.987
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.005
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   GP ~                                                
##     Age        (b)    0.315    0.098    3.196    0.001
##   PER ~                                               
##     GP         (c)    0.093    0.003   31.417    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .GP              645.883   14.798   43.646    0.000
##    .PER              21.838    0.500   43.646    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ind               0.029    0.009    3.179    0.001

Model comparison

# install.packages('semTools')
require(semTools)
diff <- compareFit(NBAfit1, NBAfit2)
summary(diff)
## ################### Nested Model Comparison #########################
## Chi-Squared Difference Test
## 
##         Df   AIC   BIC Chisq Chisq diff Df diff Pr(>Chisq)
## NBAfit1  0 58036 58067 0.000                              
## NBAfit2  1 58034 58059 0.755      0.755       1     0.3849
## 
## ####################### Model Fit Indices ###########################
##         chisq df pvalue rmsea    cfi    tli  srmr        aic        bic
## NBAfit1 .000†        NA .000† 1.000† 1.000  .000† 58035.667  58066.894 
## NBAfit2 .755   1   .385 .000† 1.000† 1.001† .005  58034.422† 58059.403†
## 
## ################## Differences in Fit Indices #######################
##                   df rmsea cfi   tli  srmr    aic   bic
## NBAfit2 - NBAfit1  1     0   0 0.001 0.005 -1.245 -7.49

Respecification of the model 2

NBAmod3 <- "
GP~b*Age
PER~a*Age+c*GP
PCT~d*PER
ind1 := b*c*d
ind2 := a*d
tot := ind1 + ind2
"
NBAfit3 <- sem(NBAmod3, data = NBAPath)
summary(NBAfit3, fit.measures = T)
## lavaan 0.6-9 ended normally after 30 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         7
##                                                       
##   Number of observations                          3810
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                87.884
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               999.296
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.914
##   Tucker-Lewis Index (TLI)                       0.741
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -27272.876
##   Loglikelihood unrestricted model (H1)     -27228.934
##                                                       
##   Akaike (AIC)                               54559.752
##   Bayesian (BIC)                             54603.469
##   Sample-size adjusted Bayesian (BIC)        54581.227
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.106
##   90 Percent confidence interval - lower         0.088
##   90 Percent confidence interval - upper         0.126
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.047
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   GP ~                                                
##     Age        (b)    0.315    0.098    3.196    0.001
##   PER ~                                               
##     Age        (a)    0.016    0.018    0.869    0.385
##     GP         (c)    0.093    0.003   31.333    0.000
##   PCT ~                                               
##     PER        (d)    0.002    0.000    4.780    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .GP              645.883   14.798   43.646    0.000
##    .PER              21.834    0.500   43.646    0.000
##    .PCT               0.023    0.001   43.646    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     ind1              0.000    0.000    2.647    0.008
##     ind2              0.000    0.000    0.855    0.393
##     tot               0.000    0.000    2.015    0.044

Parameter estimates

parameterestimates(NBAfit3, boot.ci.type = "bca.simple", standardized = T)
##     lhs op       rhs label     est     se      z pvalue ci.lower ci.upper
## 1    GP  ~       Age     b   0.315  0.098  3.196  0.001    0.122    0.507
## 2   PER  ~       Age     a   0.016  0.018  0.869  0.385   -0.020    0.051
## 3   PER  ~        GP     c   0.093  0.003 31.333  0.000    0.087    0.099
## 4   PCT  ~       PER     d   0.002  0.000  4.780  0.000    0.001    0.003
## 5    GP ~~        GP       645.883 14.798 43.646  0.000  616.879  674.887
## 6   PER ~~       PER        21.834  0.500 43.646  0.000   20.853   22.814
## 7   PCT ~~       PCT         0.023  0.001 43.646  0.000    0.022    0.025
## 8   Age ~~       Age        17.498  0.000     NA     NA   17.498   17.498
## 9  ind1 :=     b*c*d  ind1   0.000  0.000  2.647  0.008    0.000    0.000
## 10 ind2 :=       a*d  ind2   0.000  0.000  0.855  0.393    0.000    0.000
## 11  tot := ind1+ind2   tot   0.000  0.000  2.015  0.044    0.000    0.000
##     std.lv std.all std.nox
## 1    0.315   0.052   0.012
## 2    0.016   0.013   0.003
## 3    0.093   0.453   0.453
## 4    0.002   0.077   0.077
## 5  645.883   0.997   0.997
## 6   21.834   0.794   0.794
## 7    0.023   0.994   0.994
## 8   17.498   1.000  17.498
## 9    0.000   0.002   0.000
## 10   0.000   0.001   0.000
## 11   0.000   0.003   0.001
# adding bca.simple we are getting bootstrapped intervals using adjusted
# bootstrap percentile method

Bootstrapping our model

# install.packages('bootstrap')
require(bootstrap)
boot <- bootstrapLavaan(NBAfit3, R = 1000)
summary(boot)
##        b                  a                  c                 d            
##  Min.   :0.001798   Min.   :-0.04508   Min.   :0.08161   Min.   :0.0009244  
##  1st Qu.:0.248816   1st Qu.: 0.00456   1st Qu.:0.09033   1st Qu.:0.0019244  
##  Median :0.316568   Median : 0.01673   Median :0.09315   Median :0.0022315  
##  Mean   :0.315661   Mean   : 0.01603   Mean   :0.09319   Mean   :0.0022347  
##  3rd Qu.:0.382383   3rd Qu.: 0.02691   3rd Qu.:0.09579   3rd Qu.:0.0025295  
##  Max.   :0.715760   Max.   : 0.07098   Max.   :0.11068   Max.   :0.0037406  
##      GP~~GP         PER~~PER        PCT~~PCT      
##  Min.   :618.1   Min.   :19.56   Min.   :0.02219  
##  1st Qu.:638.5   1st Qu.:21.29   1st Qu.:0.02320  
##  Median :645.8   Median :21.84   Median :0.02350  
##  Mean   :645.7   Mean   :21.85   Mean   :0.02351  
##  3rd Qu.:653.0   3rd Qu.:22.39   3rd Qu.:0.02379  
##  Max.   :678.4   Max.   :25.29   Max.   :0.02488
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bootstrap_2019.6   semTools_0.5-5     piecewiseSEM_2.1.2 tidySEM_0.1.10    
## [5] lavaan_0.6-9       truncnorm_1.0-8    interactions_1.1.5 ggplot2_3.3.5     
## 
## loaded via a namespace (and not attached):
##   [1] TH.data_1.1-0         minqa_1.2.4           colorspace_2.0-2     
##   [4] ellipsis_0.3.2        rio_0.5.27            estimability_1.3     
##   [7] rstudioapi_0.13       farver_2.1.0          fansi_0.5.0          
##  [10] mvtnorm_1.1-2         codetools_0.2-18      splines_4.1.0        
##  [13] mnormt_2.0.2          knitr_1.37            texreg_1.37.5        
##  [16] jsonlite_1.7.2        nloptr_1.2.2.2        DiagrammeR_1.0.6.1   
##  [19] compiler_4.1.0        httr_1.4.2            emmeans_1.7.0        
##  [22] backports_1.2.1       assertthat_0.2.1      Matrix_1.3-3         
##  [25] lazyeval_0.2.2        cli_3.1.0             formatR_1.11         
##  [28] visNetwork_2.1.0      htmltools_0.5.1.1     tools_4.1.0          
##  [31] igraph_1.2.6          OpenMx_2.19.8         coda_0.19-4          
##  [34] gtable_0.3.0          glue_1.4.2            dplyr_1.0.7          
##  [37] Rcpp_1.0.7            carData_3.0-4         cellranger_1.1.0     
##  [40] jquerylib_0.1.4       vctrs_0.3.8           nlme_3.1-152         
##  [43] crosstalk_1.1.1       psych_2.1.9           xfun_0.29            
##  [46] stringr_1.4.0         proto_1.0.0           openxlsx_4.2.4       
##  [49] lme4_1.1-27.1         lifecycle_1.0.1       MASS_7.3-54          
##  [52] zoo_1.8-9             scales_1.1.1          hms_1.1.0            
##  [55] parallel_4.1.0        sandwich_3.0-1        RColorBrewer_1.1-2   
##  [58] yaml_2.2.1            curl_4.3.2            pander_0.6.4         
##  [61] sass_0.4.0            stringi_1.7.5         highr_0.9            
##  [64] fastDummies_1.6.3     checkmate_2.0.0       boot_1.3-28          
##  [67] zip_2.2.0             rlang_0.4.11          pkgconfig_2.0.3      
##  [70] evaluate_0.14         lattice_0.20-44       purrr_0.3.4          
##  [73] htmlwidgets_1.5.3     labeling_0.4.2        tidyselect_1.1.1     
##  [76] plyr_1.8.6            magrittr_2.0.1        R6_2.5.1             
##  [79] generics_0.1.0        multcomp_1.4-17       DBI_1.1.1            
##  [82] gsubfn_0.7            pillar_1.6.4          haven_2.4.1          
##  [85] foreign_0.8-81        withr_2.4.2           jtools_2.1.4         
##  [88] survival_3.2-11       abind_1.4-5           tibble_3.1.5         
##  [91] crayon_1.4.2          car_3.0-11            utf8_1.2.2           
##  [94] tmvnsim_1.0-2         plotly_4.10.0         MplusAutomation_1.0.0
##  [97] rmarkdown_2.11        grid_4.1.0            readxl_1.3.1         
## [100] data.table_1.14.0     pbivnorm_0.6.0        forcats_0.5.1        
## [103] digest_0.6.27         xtable_1.8-4          tidyr_1.1.3          
## [106] RcppParallel_5.1.4    stats4_4.1.0          munsell_0.5.0        
## [109] viridisLite_0.4.0     bslib_0.2.5.1

Lecture 4: Confirmatory factor analysis

Theoretical part

Data simulation

set.seed(456)
Babies = data.frame(Age = round(runif(100, 1, 30)), Weight = rnorm(100, 4000, 500))
Babies$Height = rnorm(100, 40 + 0.2 * Babies$Age + 0.004 * Babies$Weight, 5)
Babies$Gender = rbinom(100, 1, 0.5)
Babies$Crawl = rbinom(100, 1, 0.031 * Babies$Age + 1e-05 * Babies$Weight - 0.06 *
    Babies$Gender)
Babies$TummySleep = rbinom(100, 1, 0.5)
Babies$PhysicalSt = rnorm(100, 10 + 0.3 * Babies$Height + 0.1 * Babies$Age - 0.06 *
    Babies$Gender + 0.15 * Babies$TummySleep, 5)
Babies$Gender = as.factor(Babies$Gender)
levels(Babies$Gender) = c("Girls", "Boys")
# install.packages('faux')
require(faux)
## Loading required package: faux
## 
## ************
## Welcome to faux. For support and examples visit:
## https://debruine.github.io/faux/
## - Get and set global package options with: faux_options()
## ************
set.seed(456)  #seed specification for the data simulation

# Here I am specifying a correlation matrix for 6 variables. The correlation
# matrix has 6*6 = 36 values and codes all correlations for
# variable-by-variable relations. The first value in the first row is
# correlation of var1 with itself, then we go to var1-var2, var1-var3,...
# Second row starts with the var2-var1, then var2-var2...

cmat <- c(1, 0.4, 0.4, 0.1, 0.1, 0.1, 0.4, 1, 0.3, 0.1, 0.1, 0.1, 0.4, 0.2, 1, 0.1,
    0.1, 0.1, 0.1, 0.1, 0.1, 1, 0.4, 0.4, 0.1, 0.1, 0.1, 0.4, 1, 0.2, 0.1, 0.1, 0.1,
    0.4, 0.2, 1)

vars <- rnorm_multi(n = 100, 6, 30, 5, cmat)  # now we take that correlation matrix and simulate 6 variables with 100 values with mean of 30 and sd of 5. 

names(vars) = c("TimeOnTummy", "PreciseLegMoves", "PreciseHandMoves", "Babbling",
    "Screeching", "VocalImitation")  #Naming of the columns in the vars dataset. 

Babies = cbind(Babies, vars)  #combination of Babies data set with new variables. Cbind (Column bind) function adds new columns to Babies dataset. 

Covariance matrix

Matrix <- cov(vars)
Matrix[upper.tri(Matrix)] <- NA
knitr::kable(Matrix, format = "html")
TimeOnTummy PreciseLegMoves PreciseHandMoves Babbling Screeching VocalImitation
TimeOnTummy 24.8022903 NA NA NA NA NA
PreciseLegMoves 8.8224852 22.819361 NA NA NA NA
PreciseHandMoves 10.8533001 9.267157 24.266277 NA NA NA
Babbling 1.3525042 4.039860 3.519092 24.687576 NA NA
Screeching 0.5893004 3.116331 1.720064 8.081221 21.159437 NA
VocalImitation 2.7149160 4.457381 4.325576 11.314745 4.809383 25.01231

Model building

# install.packages('lavaan')
require(lavaan)  #package for our cfa function
model1 <- "
motor =~ TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ Babbling + Screeching + VocalImitation
"
# our motor LV is regressed onto TimeOnTummy, PreciseLegMoves and
# PreciseHandMoves, while our verbal LV is regressed onto Babbling, Screeching
# and VocalImitation. We used reflective coding =~ for LVs indicating that we
# assume that our LV is causing/influencing behaviour measured in our dataset.

fit1 <- cfa(model1, data = Babies)  #fitting the model
summary(fit1)  #summary of the results
## lavaan 0.6-9 ended normally after 74 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        13
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.376
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.909
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TimeOnTummy       1.000                           
##     PreciseLegMovs    0.910    0.240    3.791    0.000
##     PreciseHandMvs    1.099    0.293    3.746    0.000
##   verbal =~                                           
##     Babbling          1.000                           
##     Screeching        0.494    0.182    2.718    0.007
##     VocalImitation    0.716    0.246    2.906    0.004
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal            3.433    1.901    1.806    0.071
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      15.031    3.181    4.725    0.000
##    .PreciseLegMovs   14.709    2.867    5.130    0.000
##    .PreciseHandMvs   12.515    3.338    3.749    0.000
##    .Babbling          8.805    5.149    1.710    0.087
##    .Screeching       17.139    2.748    6.236    0.000
##    .VocalImitation   16.742    3.514    4.764    0.000
##     motor             9.523    3.625    2.627    0.009
##     verbal           15.635    5.947    2.629    0.009

Summary with standardised values

summary(fit1, standardized = TRUE)  #with standardised values 
## lavaan 0.6-9 ended normally after 74 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        13
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.376
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.909
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor =~                                                              
##     TimeOnTummy       1.000                               3.086    0.623
##     PreciseLegMovs    0.910    0.240    3.791    0.000    2.807    0.591
##     PreciseHandMvs    1.099    0.293    3.746    0.000    3.392    0.692
##   verbal =~                                                             
##     Babbling          1.000                               3.954    0.800
##     Screeching        0.494    0.182    2.718    0.007    1.952    0.426
##     VocalImitation    0.716    0.246    2.906    0.004    2.832    0.569
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor ~~                                                              
##     verbal            3.433    1.901    1.806    0.071    0.281    0.281
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .TimeOnTummy      15.031    3.181    4.725    0.000   15.031    0.612
##    .PreciseLegMovs   14.709    2.867    5.130    0.000   14.709    0.651
##    .PreciseHandMvs   12.515    3.338    3.749    0.000   12.515    0.521
##    .Babbling          8.805    5.149    1.710    0.087    8.805    0.360
##    .Screeching       17.139    2.748    6.236    0.000   17.139    0.818
##    .VocalImitation   16.742    3.514    4.764    0.000   16.742    0.676
##     motor             9.523    3.625    2.627    0.009    1.000    1.000
##     verbal           15.635    5.947    2.629    0.009    1.000    1.000

Model 2: scaling the factors by seting variance to 1:

model2 <- "
motor =~ NA*TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ NA*Babbling + Screeching + VocalImitation
motor ~~ 1*motor
verbal ~~ 1*verbal
"
# In this situation we used same specification of the latent factors, but we
# change the way how the scale of latent variables is defined. We are defining
# the scale by seting variance of LVs to 1, this is done with motor ~~ 1*motor
# and verbal ~~ 1*verbal. We also include NA* to two first variables (that are
# always used to scale LVs) that we would like to estimate.

fit2 <- cfa(model2, data = Babies)
summary(fit2, standardized = TRUE)
## lavaan 0.6-9 ended normally after 33 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        13
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.376
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.909
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor =~                                                              
##     TimeOnTummy       3.086    0.587    5.254    0.000    3.086    0.623
##     PreciseLegMovs    2.807    0.557    5.042    0.000    2.807    0.591
##     PreciseHandMvs    3.392    0.597    5.680    0.000    3.392    0.692
##   verbal =~                                                             
##     Babbling          3.954    0.752    5.259    0.000    3.954    0.800
##     Screeching        1.952    0.548    3.560    0.000    1.952    0.426
##     VocalImitation    2.832    0.646    4.381    0.000    2.832    0.569
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor ~~                                                              
##     verbal            0.281    0.139    2.028    0.043    0.281    0.281
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##     motor             1.000                               1.000    1.000
##     verbal            1.000                               1.000    1.000
##    .TimeOnTummy      15.031    3.181    4.725    0.000   15.031    0.612
##    .PreciseLegMovs   14.709    2.867    5.130    0.000   14.709    0.651
##    .PreciseHandMvs   12.515    3.338    3.749    0.000   12.515    0.521
##    .Babbling          8.805    5.149    1.710    0.087    8.805    0.360
##    .Screeching       17.139    2.748    6.236    0.000   17.139    0.818
##    .VocalImitation   16.742    3.514    4.764    0.000   16.742    0.676

Model 3: Scaling LVs by effect coding

model3 <- "
motor =~ NA*TimeOnTummy+a*TimeOnTummy + b*PreciseLegMoves + c*PreciseHandMoves
verbal =~ NA*Babbling+a1*Babbling + b1*Screeching + c1*VocalImitation
a+b+c==3
a1+b1+c1==3
"

# In this situation we used same specification of the latent factors, but we
# change the way how the scale of latent variables is defined. We are defining
# the scale by constraining the loadings of manifest variables to latent
# variable to be the exactly as number of manifest variables loaded onto factor
# (3 manifest variable == 3). We need to specify that we would like for model
# to estimate the loading of the first manifest variable: NA*TimeOnTummy, but
# also label coefficients with a, b, c, and finally make a constraint that
# a+b+c == 3

fit3 <- cfa(model3, data = Babies)
summary(fit3, standardized = TRUE)
## lavaan 0.6-9 ended normally after 58 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        15
##   Number of equality constraints                     2
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.376
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.909
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor =~                                                              
##     TimOnTmmy  (a)    0.997    0.153    6.524    0.000    3.086    0.623
##     PrcsLgMvs  (b)    0.907    0.146    6.227    0.000    2.807    0.591
##     PrcsHndMv  (c)    1.096    0.162    6.770    0.000    3.392    0.692
##   verbal =~                                                             
##     Babbling  (a1)    1.358    0.236    5.753    0.000    3.954    0.800
##     Screechng (b1)    0.670    0.158    4.236    0.000    1.952    0.426
##     VoclImttn (c1)    0.972    0.189    5.140    0.000    2.832    0.569
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor ~~                                                              
##     verbal            2.536    1.341    1.891    0.059    0.281    0.281
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .TimeOnTummy      15.031    3.181    4.725    0.000   15.031    0.612
##    .PreciseLegMovs   14.709    2.867    5.130    0.000   14.709    0.651
##    .PreciseHandMvs   12.515    3.338    3.749    0.000   12.515    0.521
##    .Babbling          8.805    5.149    1.710    0.087    8.805    0.360
##    .Screeching       17.139    2.748    6.236    0.000   17.139    0.818
##    .VocalImitation   16.742    3.514    4.764    0.000   16.742    0.676
##     motor             9.581    2.070    4.628    0.000    1.000    1.000
##     verbal            8.483    1.896    4.475    0.000    1.000    1.000
## 
## Constraints:
##                                                |Slack|
##     a+b+c - (3)                                  0.000
##     a1+b1+c1 - (3)                               0.000

Adding intercepts to our model:

model3 <- "
motor =~ TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ Babbling + Screeching + VocalImitation
TimeOnTummy ~ 1
PreciseLegMoves ~ 1
PreciseHandMoves ~ 1
Babbling ~ 1
Screeching ~ 1 
VocalImitation ~ 1"
# Model stays again identical to the first one (scaled by the measure of first
# variables), however we add TimeOnTummy ~ 1 and this is identical for all
# measured variables.
fit3 <- cfa(model3, data = Babies)
summary(fit3, standardized = TRUE, fit.measures = T)
## lavaan 0.6-9 ended normally after 74 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        19
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.376
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.909
## 
## Model Test Baseline Model:
## 
##   Test statistic                                88.357
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.118
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1756.127
##   Loglikelihood unrestricted model (H1)      -1754.439
##                                                       
##   Akaike (AIC)                                3550.253
##   Bayesian (BIC)                              3599.751
##   Sample-size adjusted Bayesian (BIC)         3539.744
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.047
##   P-value RMSEA <= 0.05                          0.954
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.034
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor =~                                                              
##     TimeOnTummy       1.000                               3.086    0.623
##     PreciseLegMovs    0.910    0.240    3.791    0.000    2.807    0.591
##     PreciseHandMvs    1.099    0.293    3.746    0.000    3.392    0.692
##   verbal =~                                                             
##     Babbling          1.000                               3.954    0.800
##     Screeching        0.494    0.182    2.718    0.007    1.952    0.426
##     VocalImitation    0.716    0.246    2.906    0.004    2.832    0.569
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   motor ~~                                                              
##     verbal            3.433    1.901    1.806    0.071    0.281    0.281
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .TimeOnTummy      31.423    0.496   63.415    0.000   31.423    6.341
##    .PreciseLegMovs   30.600    0.475   64.380    0.000   30.600    6.438
##    .PreciseHandMvs   29.799    0.490   60.798    0.000   29.799    6.080
##    .Babbling         29.946    0.494   60.573    0.000   29.946    6.057
##    .Screeching       29.785    0.458   65.078    0.000   29.785    6.508
##    .VocalImitation   30.409    0.498   61.110    0.000   30.409    6.111
##     motor             0.000                               0.000    0.000
##     verbal            0.000                               0.000    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .TimeOnTummy      15.031    3.181    4.725    0.000   15.031    0.612
##    .PreciseLegMovs   14.709    2.867    5.130    0.000   14.709    0.651
##    .PreciseHandMvs   12.515    3.338    3.749    0.000   12.515    0.521
##    .Babbling          8.805    5.149    1.710    0.087    8.805    0.360
##    .Screeching       17.139    2.748    6.236    0.000   17.139    0.818
##    .VocalImitation   16.742    3.514    4.764    0.000   16.742    0.676
##     motor             9.523    3.625    2.627    0.009    1.000    1.000
##     verbal           15.635    5.947    2.629    0.009    1.000    1.000

Indices of global model fit

summary(fit1, fit.measures = TRUE)
## lavaan 0.6-9 ended normally after 74 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        13
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 3.376
##   Degrees of freedom                                 8
##   P-value (Chi-square)                           0.909
## 
## Model Test Baseline Model:
## 
##   Test statistic                                88.357
##   Degrees of freedom                                15
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000
##   Tucker-Lewis Index (TLI)                       1.118
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -1756.127
##   Loglikelihood unrestricted model (H1)      -1754.439
##                                                       
##   Akaike (AIC)                                3538.253
##   Bayesian (BIC)                              3572.120
##   Sample-size adjusted Bayesian (BIC)         3531.063
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000
##   90 Percent confidence interval - lower         0.000
##   90 Percent confidence interval - upper         0.047
##   P-value RMSEA <= 0.05                          0.954
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.038
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TimeOnTummy       1.000                           
##     PreciseLegMovs    0.910    0.240    3.791    0.000
##     PreciseHandMvs    1.099    0.293    3.746    0.000
##   verbal =~                                           
##     Babbling          1.000                           
##     Screeching        0.494    0.182    2.718    0.007
##     VocalImitation    0.716    0.246    2.906    0.004
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal            3.433    1.901    1.806    0.071
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      15.031    3.181    4.725    0.000
##    .PreciseLegMovs   14.709    2.867    5.130    0.000
##    .PreciseHandMvs   12.515    3.338    3.749    0.000
##    .Babbling          8.805    5.149    1.710    0.087
##    .Screeching       17.139    2.748    6.236    0.000
##    .VocalImitation   16.742    3.514    4.764    0.000
##     motor             9.523    3.625    2.627    0.009
##     verbal           15.635    5.947    2.629    0.009

Structural equation model: CFA + Path

model4 <- "
#CFA model
motor =~ TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ Babbling + Screeching + VocalImitation

#Path model
motor ~ Age + Weight
verbal ~ Age + Weight
"

fit4 <- sem(model4, data = Babies)
summary(fit4, standardized = TRUE)
## lavaan 0.6-9 ended normally after 81 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        17
##                                                       
##   Number of observations                           100
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                13.018
##   Degrees of freedom                                16
##   P-value (Chi-square)                           0.671
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate    Std.Err  z-value  P(>|z|)   Std.lv    Std.all
##   motor =~                                                                  
##     TimeOnTummy         1.000                                 3.064    0.618
##     PreciseLegMovs      0.919    0.242    3.803    0.000      2.816    0.592
##     PreciseHandMvs      1.111    0.295    3.765    0.000      3.403    0.694
##   verbal =~                                                                 
##     Babbling            1.000                                 3.498    0.708
##     Screeching          0.583    0.189    3.089    0.002      2.040    0.446
##     VocalImitation      0.899    0.263    3.422    0.001      3.144    0.632
## 
## Regressions:
##                    Estimate    Std.Err  z-value  P(>|z|)   Std.lv    Std.all
##   motor ~                                                                   
##     Age                -0.016    0.045   -0.355    0.723     -0.005   -0.043
##     Weight              0.000    0.001    0.085    0.932      0.000    0.010
##   verbal ~                                                                  
##     Age                -0.041    0.051   -0.803    0.422     -0.012   -0.097
##     Weight              0.002    0.001    2.108    0.035      0.001    0.263
## 
## Covariances:
##                    Estimate    Std.Err  z-value  P(>|z|)   Std.lv    Std.all
##  .motor ~~                                                                  
##    .verbal              3.292    1.738    1.894    0.058      0.320    0.320
## 
## Variances:
##                    Estimate    Std.Err  z-value  P(>|z|)   Std.lv    Std.all
##    .TimeOnTummy        15.164    3.159    4.800    0.000     15.164    0.618
##    .PreciseLegMovs     14.662    2.862    5.122    0.000     14.662    0.649
##    .PreciseHandMvs     12.440    3.327    3.740    0.000     12.440    0.518
##    .Babbling           12.204    3.740    3.263    0.001     12.204    0.499
##    .Screeching         16.784    2.717    6.177    0.000     16.784    0.801
##    .VocalImitation     14.880    3.440    4.326    0.000     14.880    0.601
##    .motor               9.372    3.577    2.620    0.009      0.998    0.998
##    .verbal             11.299    4.205    2.687    0.007      0.923    0.923

Measurement invariance

Configural invariance

modelMI <- "
motor =~ TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ Babbling + Screeching + VocalImitation
"

fitMIC <- cfa(modelMI, data = Babies, group = "Gender")  # for configural invariance we can specify group gender and leave all other parameters to be unrestricted
summary(fitMIC)
## lavaan 0.6-9 ended normally after 141 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        38
##                                                       
##   Number of observations per group:                   
##     Girls                                           48
##     Boys                                            52
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                15.880
##   Degrees of freedom                                16
##   P-value (Chi-square)                           0.461
##   Test statistic for each group:
##     Girls                                        6.246
##     Boys                                         9.634
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## 
## Group 1 [Girls]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TimeOnTummy       1.000                           
##     PreciseLegMovs    0.754    0.282    2.677    0.007
##     PreciseHandMvs    0.982    0.345    2.849    0.004
##   verbal =~                                           
##     Babbling          1.000                           
##     Screeching        0.428    0.237    1.809    0.070
##     VocalImitation    0.632    0.313    2.021    0.043
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal            7.298    3.772    1.935    0.053
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      31.723    0.769   41.272    0.000
##    .PreciseLegMovs   30.863    0.734   42.020    0.000
##    .PreciseHandMvs   30.204    0.725   41.681    0.000
##    .Babbling         29.693    0.781   37.997    0.000
##    .Screeching       29.281    0.669   43.755    0.000
##    .VocalImitation   30.893    0.750   41.215    0.000
##     motor             0.000                           
##     verbal            0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      15.328    5.201    2.947    0.003
##    .PreciseLegMovs   18.476    4.570    4.042    0.000
##    .PreciseHandMvs   12.648    4.739    2.669    0.008
##    .Babbling         11.319    8.343    1.357    0.175
##    .Screeching       18.192    4.110    4.426    0.000
##    .VocalImitation   19.781    5.256    3.764    0.000
##     motor            13.031    6.402    2.036    0.042
##     verbal           17.992    9.733    1.849    0.065
## 
## 
## Group 2 [Boys]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TimeOnTummy       1.000                           
##     PreciseLegMovs    1.168    0.448    2.607    0.009
##     PreciseHandMvs    1.106    0.409    2.706    0.007
##   verbal =~                                           
##     Babbling          1.000                           
##     Screeching        0.410    0.238    1.727    0.084
##     VocalImitation    0.562    0.302    1.861    0.063
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal           -0.787    2.007   -0.392    0.695
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      31.147    0.634   49.148    0.000
##    .PreciseLegMovs   30.357    0.611   49.675    0.000
##    .PreciseHandMvs   29.426    0.660   44.593    0.000
##    .Babbling         30.179    0.617   48.873    0.000
##    .Screeching       30.251    0.620   48.790    0.000
##    .VocalImitation   29.962    0.655   45.744    0.000
##     motor             0.000                           
##     verbal            0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      13.783    3.710    3.715    0.000
##    .PreciseLegMovs    9.737    3.951    2.465    0.014
##    .PreciseHandMvs   13.963    4.139    3.373    0.001
##    .Babbling         -0.009    9.707   -0.001    0.999
##    .Screeching       16.649    3.652    4.559    0.000
##    .VocalImitation   16.035    4.395    3.649    0.000
##     motor             7.101    3.990    1.779    0.075
##     verbal           19.837   10.457    1.897    0.058

Metric invariance

In the case of metric invariance we would like to compare loadings on the factor structures between the two groups. If there are no differences between the models (model without restricted loadings fitting worse), then we have same factor loadings between the groups.

modelMI <- "
motor =~ TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ Babbling + Screeching + VocalImitation
"

fitMIM <- cfa(modelMI, data = Babies, group = "Gender", group.equal = "loadings")  #we restrict the loadings on factors between two groups. 
summary(fitMIM)
## lavaan 0.6-9 ended normally after 135 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        38
##   Number of equality constraints                     4
##                                                       
##   Number of observations per group:                   
##     Girls                                           48
##     Boys                                            52
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                16.556
##   Degrees of freedom                                20
##   P-value (Chi-square)                           0.682
##   Test statistic for each group:
##     Girls                                        6.606
##     Boys                                         9.951
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## 
## Group 1 [Girls]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TmOnTmm           1.000                           
##     PrcsLgM (.p2.)    0.949    0.247    3.840    0.000
##     PrcsHnM (.p3.)    1.037    0.267    3.880    0.000
##   verbal =~                                           
##     Babblng           1.000                           
##     Scrchng (.p5.)    0.430    0.164    2.619    0.009
##     VclImtt (.p6.)    0.601    0.211    2.854    0.004
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal            7.124    3.399    2.096    0.036
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      31.723    0.757   41.929    0.000
##    .PreciseLegMovs   30.863    0.751   41.121    0.000
##    .PreciseHandMvs   30.204    0.722   41.828    0.000
##    .Babbling         29.693    0.782   37.967    0.000
##    .Screeching       29.281    0.671   43.661    0.000
##    .VocalImitation   30.893    0.747   41.353    0.000
##     motor             0.000                           
##     verbal            0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      16.616    4.612    3.603    0.000
##    .PreciseLegMovs   17.266    4.551    3.794    0.000
##    .PreciseHandMvs   13.341    4.238    3.148    0.002
##    .Babbling         10.817    7.130    1.517    0.129
##    .Screeching       18.163    4.020    4.518    0.000
##    .VocalImitation   20.094    4.867    4.128    0.000
##     motor            10.861    4.723    2.300    0.021
##     verbal           18.541    8.400    2.207    0.027
## 
## 
## Group 2 [Boys]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TmOnTmm           1.000                           
##     PrcsLgM (.p2.)    0.949    0.247    3.840    0.000
##     PrcsHnM (.p3.)    1.037    0.267    3.880    0.000
##   verbal =~                                           
##     Babblng           1.000                           
##     Scrchng (.p5.)    0.430    0.164    2.619    0.009
##     VclImtt (.p6.)    0.601    0.211    2.854    0.004
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal           -0.695    2.209   -0.315    0.753
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      31.147    0.643   48.419    0.000
##    .PreciseLegMovs   30.357    0.600   50.563    0.000
##    .PreciseHandMvs   29.426    0.662   44.430    0.000
##    .Babbling         30.179    0.617   48.879    0.000
##    .Screeching       30.251    0.619   48.885    0.000
##    .VocalImitation   29.962    0.657   45.613    0.000
##     motor             0.000                           
##     verbal            0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      12.951    3.622    3.576    0.000
##    .PreciseLegMovs   11.037    3.171    3.481    0.000
##    .PreciseHandMvs   13.593    3.843    3.537    0.000
##    .Babbling          1.076    6.336    0.170    0.865
##    .Screeching       16.449    3.442    4.779    0.000
##    .VocalImitation   15.667    3.841    4.079    0.000
##     motor             8.566    3.646    2.349    0.019
##     verbal           18.747    7.377    2.541    0.011
# install.packages('semTools')
require(semTools)
summary(compareFit(fitMIC, fitMIM))  # compareFit compares our two models and gives as an information of whether one is worse than the other. If model with restricted loadings is worse that is an indication that loadings are different. 
## ################### Nested Model Comparison #########################
## Chi-Squared Difference Test
## 
##        Df    AIC    BIC  Chisq Chisq diff Df diff Pr(>Chisq)
## fitMIC 16 3573.3 3672.3 15.880                              
## fitMIM 20 3566.0 3654.6 16.556    0.67673       4     0.9542
## 
## ####################### Model Fit Indices ###########################
##          chisq df pvalue rmsea    cfi    tli  srmr       aic       bic
## fitMIC 15.880† 16   .461 .000† 1.000† 1.003  .065  3573.335  3672.332 
## fitMIM 16.556  20   .682 .000† 1.000† 1.067† .064† 3566.012† 3654.588†
## 
## ################## Differences in Fit Indices #######################
##                 df rmsea cfi   tli   srmr    aic     bic
## fitMIM - fitMIC  4     0   0 0.064 -0.001 -7.323 -17.744

Scalar invariance

In the case of scalar invariance we would like to restrict both loadings and intercepts between two groups. Similar to the previous two invariances, when we compare the models if the scalar invariance model fits worse than metric invariance or configural invariance, then we can assume that intercepts or means are not identical between the groups.

modelMI <- "
motor =~ TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ Babbling + Screeching + VocalImitation
"

fitMISc <- cfa(modelMI, data = Babies, group = "Gender", group.equal = c("loadings",
    "intercepts"))  #Restriction of the loadings and intercepts 
## Warning in lav_object_post_check(object): lavaan WARNING: some estimated ov
## variances are negative
summary(fitMISc)
## lavaan 0.6-9 ended normally after 153 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        40
##   Number of equality constraints                    10
##                                                       
##   Number of observations per group:                   
##     Girls                                           48
##     Boys                                            52
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                19.394
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.731
##   Test statistic for each group:
##     Girls                                        8.252
##     Boys                                        11.142
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## 
## Group 1 [Girls]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TmOnTmm           1.000                           
##     PrcsLgM (.p2.)    0.949    0.244    3.880    0.000
##     PrcsHnM (.p3.)    1.045    0.266    3.925    0.000
##   verbal =~                                           
##     Babblng           1.000                           
##     Scrchng (.p5.)    0.410    0.163    2.510    0.012
##     VclImtt (.p6.)    0.560    0.206    2.715    0.007
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal            7.181    3.418    2.101    0.036
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TmOnTmm (.16.)   31.753    0.661   48.007    0.000
##    .PrcsLgM (.17.)   30.920    0.638   48.461    0.000
##    .PrcsHnM (.18.)   30.141    0.661   45.611    0.000
##    .Babblng (.19.)   29.775    0.767   38.801    0.000
##    .Scrchng (.20.)   29.718    0.504   58.930    0.000
##    .VclImtt (.21.)   30.219    0.575   52.515    0.000
##     motor             0.000                           
##     verbal            0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      16.671    4.605    3.620    0.000
##    .PreciseLegMovs   17.293    4.543    3.807    0.000
##    .PreciseHandMvs   13.265    4.238    3.130    0.002
##    .Babbling          9.888    7.683    1.287    0.198
##    .Screeching       18.494    4.071    4.543    0.000
##    .VocalImitation   20.960    4.956    4.229    0.000
##     motor            10.803    4.679    2.309    0.021
##     verbal           19.542    8.992    2.173    0.030
## 
## 
## Group 2 [Boys]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TmOnTmm           1.000                           
##     PrcsLgM (.p2.)    0.949    0.244    3.880    0.000
##     PrcsHnM (.p3.)    1.045    0.266    3.925    0.000
##   verbal =~                                           
##     Babblng           1.000                           
##     Scrchng (.p5.)    0.410    0.163    2.510    0.012
##     VclImtt (.p6.)    0.560    0.206    2.715    0.007
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal           -0.874    2.208   -0.396    0.692
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TmOnTmm (.16.)   31.753    0.661   48.007    0.000
##    .PrcsLgM (.17.)   30.920    0.638   48.461    0.000
##    .PrcsHnM (.18.)   30.141    0.661   45.611    0.000
##    .Babblng (.19.)   29.775    0.767   38.801    0.000
##    .Scrchng (.20.)   29.718    0.504   58.930    0.000
##    .VclImtt (.21.)   30.219    0.575   52.515    0.000
##     motor            -0.628    0.766   -0.819    0.413
##     verbal            0.405    0.985    0.411    0.681
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TimeOnTummy      12.970    3.609    3.594    0.000
##    .PreciseLegMovs   11.054    3.159    3.500    0.000
##    .PreciseHandMvs   13.564    3.847    3.526    0.000
##    .Babbling         -0.129    7.015   -0.018    0.985
##    .Screeching       16.806    3.500    4.802    0.000
##    .VocalImitation   16.307    3.880    4.203    0.000
##     motor             8.523    3.612    2.359    0.018
##     verbal           19.957    8.027    2.486    0.013
summary(compareFit(fitMIM, fitMISc))  # comparison between measurement invariance and scalar invariance model
## ################### Nested Model Comparison #########################
## Chi-Squared Difference Test
## 
##         Df    AIC    BIC  Chisq Chisq diff Df diff Pr(>Chisq)
## fitMIM  20 3566.0 3654.6 16.556                              
## fitMISc 24 3560.8 3639.0 19.394     2.8375       4     0.5854
## 
## ####################### Model Fit Indices ###########################
##           chisq df pvalue rmsea    cfi    tli  srmr       aic       bic
## fitMIM  16.556† 20   .682 .000† 1.000† 1.067  .064† 3566.012  3654.588 
## fitMISc 19.394  24   .731 .000† 1.000† 1.074† .071  3560.850† 3639.005†
## 
## ################## Differences in Fit Indices #######################
##                  df rmsea cfi   tli  srmr    aic     bic
## fitMISc - fitMIM  4     0   0 0.008 0.007 -5.162 -15.583

Strict invariance

Strict invariance restricts loadings, intercepts and error variances. In this case, not only that we are assuming that all direct effects are identical (loadings and intercepts), but we also test whether unexplained variance is identical. This is akin to saying that the same data generating proces and structural effects can be assumed between the two groups.

modelMI <- "
motor =~ TimeOnTummy + PreciseLegMoves + PreciseHandMoves
verbal =~ Babbling + Screeching + VocalImitation
"

fitMISt <- cfa(modelMI, data = Babies, group = "Gender", group.equal = c("loadings",
    "intercepts", "residuals"))  #restricting loadings, intercepts and residuals
summary(fitMISt)
## lavaan 0.6-9 ended normally after 107 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        40
##   Number of equality constraints                    16
##                                                       
##   Number of observations per group:                   
##     Girls                                           48
##     Boys                                            52
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                25.722
##   Degrees of freedom                                30
##   P-value (Chi-square)                           0.689
##   Test statistic for each group:
##     Girls                                       10.852
##     Boys                                        14.870
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## 
## Group 1 [Girls]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TmOnTmm           1.000                           
##     PrcsLgM (.p2.)    0.916    0.237    3.858    0.000
##     PrcsHnM (.p3.)    1.046    0.270    3.872    0.000
##   verbal =~                                           
##     Babblng           1.000                           
##     Scrchng (.p5.)    0.357    0.152    2.352    0.019
##     VclImtt (.p6.)    0.497    0.194    2.561    0.010
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal            7.237    3.475    2.083    0.037
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TmOnTmm (.16.)   31.754    0.663   47.879    0.000
##    .PrcsLgM (.17.)   30.903    0.624   49.533    0.000
##    .PrcsHnM (.18.)   30.145    0.673   44.815    0.000
##    .Babblng (.19.)   29.707    0.776   38.266    0.000
##    .Scrchng (.20.)   29.700    0.506   58.645    0.000
##    .VclImtt (.21.)   30.290    0.582   52.081    0.000
##     motor             0.000                           
##     verbal            0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TmOnTmm (.p7.)   14.722    3.138    4.691    0.000
##    .PrcsLgM (.p8.)   14.342    2.843    5.045    0.000
##    .PrcsHnM (.p9.)   13.256    3.161    4.194    0.000
##    .Babblng (.10.)    1.930    7.776    0.248    0.804
##    .Scrchng (.11.)   18.072    2.752    6.568    0.000
##    .VclImtt (.12.)   19.193    3.335    5.755    0.000
##     motor            11.438    4.800    2.383    0.017
##     verbal           27.035    9.847    2.745    0.006
## 
## 
## Group 2 [Boys]:
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor =~                                            
##     TmOnTmm           1.000                           
##     PrcsLgM (.p2.)    0.916    0.237    3.858    0.000
##     PrcsHnM (.p3.)    1.046    0.270    3.872    0.000
##   verbal =~                                           
##     Babblng           1.000                           
##     Scrchng (.p5.)    0.357    0.152    2.352    0.019
##     VclImtt (.p6.)    0.497    0.194    2.561    0.010
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   motor ~~                                            
##     verbal           -0.554    2.239   -0.248    0.804
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TmOnTmm (.16.)   31.754    0.663   47.879    0.000
##    .PrcsLgM (.17.)   30.903    0.624   49.533    0.000
##    .PrcsHnM (.18.)   30.145    0.673   44.815    0.000
##    .Babblng (.19.)   29.707    0.776   38.266    0.000
##    .Scrchng (.20.)   29.700    0.506   58.645    0.000
##    .VclImtt (.21.)   30.290    0.582   52.081    0.000
##     motor            -0.635    0.772   -0.823    0.411
##     verbal            0.459    0.994    0.462    0.644
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .TmOnTmm (.p7.)   14.722    3.138    4.691    0.000
##    .PrcsLgM (.p8.)   14.342    2.843    5.045    0.000
##    .PrcsHnM (.p9.)   13.256    3.161    4.194    0.000
##    .Babblng (.10.)    1.930    7.776    0.248    0.804
##    .Scrchng (.11.)   18.072    2.752    6.568    0.000
##    .VclImtt (.12.)   19.193    3.335    5.755    0.000
##     motor             8.157    3.558    2.292    0.022
##     verbal           18.233    8.616    2.116    0.034
summary(compareFit(fitMISc, fitMISt))  # model comparison
## ################### Nested Model Comparison #########################
## Chi-Squared Difference Test
## 
##         Df    AIC    BIC  Chisq Chisq diff Df diff Pr(>Chisq)
## fitMISc 24 3560.8 3639.0 19.394                              
## fitMISt 30 3555.2 3617.7 25.722      6.328       6     0.3875
## 
## ####################### Model Fit Indices ###########################
##           chisq df pvalue rmsea    cfi    tli  srmr       aic       bic
## fitMISc 19.394† 24   .731 .000† 1.000† 1.074† .071† 3560.850  3639.005 
## fitMISt 25.722  30   .689 .000† 1.000† 1.055  .079  3555.178† 3617.702†
## 
## ################## Differences in Fit Indices #######################
##                   df rmsea cfi    tli  srmr    aic     bic
## fitMISt - fitMISc  6     0   0 -0.019 0.008 -5.672 -21.303

Finding differences between the models

If you have differences in the fit of the model, for example, between Measurement invariance and configural invariance model, you would like to see which parameters are different between the groups. You can do that using this function:

lavTestScore(fitMISc)
## $test
## 
## total score test:
## 
##    test    X2 df p.value
## 1 score 3.387 10   0.971
## 
## $uni
## 
## univariate score tests:
## 
##      lhs op   rhs    X2 df p.value
## 1   .p2. == .p25. 0.479  1   0.489
## 2   .p3. == .p26. 0.015  1   0.903
## 3   .p5. == .p28. 0.006  1   0.939
## 4   .p6. == .p29. 0.047  1   0.828
## 5  .p16. == .p39. 0.007  1   0.936
## 6  .p17. == .p40. 0.021  1   0.885
## 7  .p18. == .p41. 0.045  1   0.831
## 8  .p19. == .p42. 0.278  1   0.598
## 9  .p20. == .p43. 0.958  1   0.328
## 10 .p21. == .p44. 1.948  1   0.163

Practical part:

# install.packages('sem')
require(sem)
data("HS.data")  # reading the HS data

Descriptives:

dim(HS.data)  #dimensions of the dataset
## [1] 301  32
summary(HS.data[, c("visual", "cubes", "flags", "paragrap", "sentence", "wordm",
    "addition", "counting", "straight")])  # descriptive statistics for specific variables in our dataset
##      visual          cubes           flags       paragrap         sentence    
##  Min.   : 4.00   Min.   : 9.00   Min.   : 2   Min.   : 0.000   Min.   : 4.00  
##  1st Qu.:25.00   1st Qu.:21.00   1st Qu.:11   1st Qu.: 7.000   1st Qu.:14.00  
##  Median :30.00   Median :24.00   Median :17   Median : 9.000   Median :18.00  
##  Mean   :29.61   Mean   :24.35   Mean   :18   Mean   : 9.183   Mean   :17.36  
##  3rd Qu.:34.00   3rd Qu.:27.00   3rd Qu.:25   3rd Qu.:11.000   3rd Qu.:21.00  
##  Max.   :51.00   Max.   :37.00   Max.   :36   Max.   :19.000   Max.   :28.00  
##      wordm         addition         counting        straight    
##  Min.   : 1.0   Min.   : 30.00   Min.   : 61.0   Min.   :100.0  
##  1st Qu.:10.0   1st Qu.: 80.00   1st Qu.: 97.0   1st Qu.:171.0  
##  Median :14.0   Median : 94.00   Median :110.0   Median :195.0  
##  Mean   :15.3   Mean   : 96.24   Mean   :110.5   Mean   :193.4  
##  3rd Qu.:19.0   3rd Qu.:113.00   3rd Qu.:122.0   3rd Qu.:219.0  
##  Max.   :43.0   Max.   :171.00   Max.   :200.0   Max.   :333.0

Plots:

# install.packages('psych')
require(psych)
scatter.hist(x = HS.data$visual, y = HS.data$cubes, density = T, ellipse = T)  # bi-variate scatterplot for our data

Specification of the model:

detach("package:sem")
fact3 <- "
spatial =~ visual + cubes + flags
verbal =~ paragrap + sentence + wordm
speed =~ addition + counting + straight
"

fact3fit <- cfa(fact3, data = HS.data)
summary(fact3fit, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-9 ended normally after 150 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
##                                                       
##   Number of observations                           301
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                85.172
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               918.592
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.931
##   Tucker-Lewis Index (TLI)                       0.896
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -9578.017
##   Loglikelihood unrestricted model (H1)      -9535.431
##                                                       
##   Akaike (AIC)                               19198.034
##   Bayesian (BIC)                             19275.883
##   Sample-size adjusted Bayesian (BIC)        19209.283
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092
##   90 Percent confidence interval - lower         0.071
##   90 Percent confidence interval - upper         0.114
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   spatial =~                                                            
##     visual            1.000                               5.397    0.772
##     cubes             0.369    0.066    5.553    0.000    1.992    0.424
##     flags             0.973    0.146    6.682    0.000    5.250    0.581
##   verbal =~                                                             
##     paragrap          1.000                               2.969    0.852
##     sentence          1.484    0.087   17.015    0.000    4.406    0.855
##     wordm             2.161    0.129   16.703    0.000    6.416    0.838
##   speed =~                                                              
##     addition          1.000                              14.235    0.570
##     counting          1.026    0.144    7.152    0.000   14.611    0.723
##     straight          1.696    0.237    7.154    0.000   24.143    0.665
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   spatial ~~                                                            
##     verbal            7.347    1.323    5.552    0.000    0.459    0.459
##     speed            36.098    7.755    4.655    0.000    0.470    0.470
##   verbal ~~                                                             
##     speed            11.991    3.401    3.525    0.000    0.284    0.284
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .visual           19.774    4.091    4.833    0.000   19.774    0.404
##    .cubes            18.139    1.628   11.145    0.000   18.139    0.820
##    .flags            54.033    5.801    9.314    0.000   54.033    0.662
##    .paragrap          3.340    0.429    7.778    0.000    3.340    0.275
##    .sentence          7.140    0.934    7.642    0.000    7.140    0.269
##    .wordm            17.455    2.109    8.278    0.000   17.455    0.298
##    .addition        421.390   42.927    9.816    0.000  421.390    0.675
##    .counting        195.314   29.676    6.582    0.000  195.314    0.478
##    .straight        735.487   91.898    8.003    0.000  735.487    0.558
##     spatial          29.127    5.238    5.561    0.000    1.000    1.000
##     verbal            8.816    1.009    8.737    0.000    1.000    1.000
##     speed           202.639   45.503    4.453    0.000    1.000    1.000

Explained variance - R2

inspect(fact3fit, "r2")  #get r2 for our measured variables
##   visual    cubes    flags paragrap sentence    wordm addition counting 
##    0.596    0.180    0.338    0.725    0.731    0.702    0.325    0.522 
## straight 
##    0.442

Multivariate normality

# install.packages('MVN')
require(MVN)
test <- mvn(HS.data[, c("visual", "cubes", "flags", "paragrap", "sentence", "wordm",
    "addition", "counting", "straight")], mvnTest = "royston")  # multivariate normality for the variables used in our model
test$multivariateNormality
##      Test        H      p value MVN
## 1 Royston 125.7982 6.647398e-23  NO

If we do not have multivariate normality, we can calculate robust errors and different test statistic

fact3fitRob <- cfa(fact3, data = HS.data, se = "robust.sem", test = "satorra.bentler")
summary(fact3fitRob, standardized = TRUE)
## lavaan 0.6-9 ended normally after 150 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
##                                                       
##   Number of observations                           301
##                                                       
## Model Test User Model:
##                                               Standard      Robust
##   Test Statistic                                85.172      80.756
##   Degrees of freedom                                24          24
##   P-value (Chi-square)                           0.000       0.000
##   Scaling correction factor                                  1.055
##        Satorra-Bentler correction                                 
## 
## Parameter Estimates:
## 
##   Standard errors                           Robust.sem
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   spatial =~                                                            
##     visual            1.000                               5.397    0.772
##     cubes             0.369    0.069    5.358    0.000    1.992    0.424
##     flags             0.973    0.153    6.364    0.000    5.250    0.581
##   verbal =~                                                             
##     paragrap          1.000                               2.969    0.852
##     sentence          1.484    0.089   16.763    0.000    4.406    0.855
##     wordm             2.161    0.139   15.498    0.000    6.416    0.838
##   speed =~                                                              
##     addition          1.000                              14.235    0.570
##     counting          1.026    0.132    7.755    0.000   14.611    0.723
##     straight          1.696    0.208    8.170    0.000   24.143    0.665
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   spatial ~~                                                            
##     verbal            7.347    1.480    4.965    0.000    0.459    0.459
##     speed            36.098    7.596    4.752    0.000    0.470    0.470
##   verbal ~~                                                             
##     speed            11.991    3.811    3.146    0.002    0.284    0.284
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .visual           19.774    4.982    3.969    0.000   19.774    0.404
##    .cubes            18.139    1.719   10.552    0.000   18.139    0.820
##    .flags            54.033    5.413    9.981    0.000   54.033    0.662
##    .paragrap          3.340    0.450    7.423    0.000    3.340    0.275
##    .sentence          7.140    0.929    7.689    0.000    7.140    0.269
##    .wordm            17.455    2.267    7.701    0.000   17.455    0.298
##    .addition        421.390   41.596   10.130    0.000  421.390    0.675
##    .counting        195.314   29.691    6.578    0.000  195.314    0.478
##    .straight        735.487   88.246    8.335    0.000  735.487    0.558
##     spatial          29.127    6.024    4.836    0.000    1.000    1.000
##     verbal            8.816    1.087    8.109    0.000    1.000    1.000
##     speed           202.639   43.714    4.636    0.000    1.000    1.000

Modification indices

Finally, we can check modification indices and include some of them into our model.

mi <- modindices(fact3fitRob)
mi
##         lhs op      rhs     mi      epc  sepc.lv sepc.all sepc.nox
## 25  spatial =~ paragrap  1.216    0.038    0.207    0.059    0.059
## 26  spatial =~ sentence  7.446   -0.140   -0.756   -0.147   -0.147
## 27  spatial =~    wordm  2.839    0.130    0.701    0.092    0.092
## 28  spatial =~ addition 18.568   -1.611   -8.697   -0.348   -0.348
## 29  spatial =~ counting  4.192   -0.692   -3.736   -0.185   -0.185
## 30  spatial =~ straight 36.044    3.446   18.600    0.512    0.512
## 31   verbal =~   visual  8.923    0.702    2.083    0.298    0.298
## 32   verbal =~    cubes  0.018   -0.015   -0.045   -0.010   -0.010
## 33   verbal =~    flags  9.162   -0.725   -2.153   -0.238   -0.238
## 34   verbal =~ addition  0.074   -0.139   -0.414   -0.017   -0.017
## 35   verbal =~ counting  3.403   -0.811   -2.407   -0.119   -0.119
## 36   verbal =~ straight  4.693    1.646    4.886    0.135    0.135
## 37    speed =~   visual  0.014    0.006    0.091    0.013    0.013
## 38    speed =~    cubes  1.572   -0.034   -0.490   -0.104   -0.104
## 39    speed =~    flags  0.711    0.047    0.671    0.074    0.074
## 40    speed =~ paragrap  0.002   -0.001   -0.008   -0.002   -0.002
## 41    speed =~ sentence  0.199   -0.008   -0.109   -0.021   -0.021
## 42    speed =~    wordm  0.260    0.013    0.187    0.024    0.024
## 43   visual ~~    cubes  3.619   -4.421   -4.421   -0.233   -0.233
## 44   visual ~~    flags  0.929   -6.633   -6.633   -0.203   -0.203
## 45   visual ~~ paragrap  3.547    1.408    1.408    0.173    0.173
## 46   visual ~~ sentence  0.521   -0.795   -0.795   -0.067   -0.067
## 47   visual ~~    wordm  0.049    0.370    0.370    0.020    0.020
## 48   visual ~~ addition  5.461  -17.860  -17.860   -0.196   -0.196
## 49   visual ~~ counting  0.602   -4.804   -4.804   -0.077   -0.077
## 50   visual ~~ straight  7.252   29.650   29.650    0.246    0.246
## 51    cubes ~~    flags  8.529    6.986    6.986    0.223    0.223
## 52    cubes ~~ paragrap  0.535   -0.406   -0.406   -0.052   -0.052
## 53    cubes ~~ sentence  0.022   -0.122   -0.122   -0.011   -0.011
## 54    cubes ~~    wordm  0.786    1.099    1.099    0.062    0.062
## 55    cubes ~~ addition  8.918  -16.784  -16.784   -0.192   -0.192
## 56    cubes ~~ counting  0.053   -0.993   -0.993   -0.017   -0.017
## 57    cubes ~~ straight  1.907   10.864   10.864    0.094    0.094
## 58    flags ~~ paragrap  0.143   -0.381   -0.381   -0.028   -0.028
## 59    flags ~~ sentence  7.860   -4.163   -4.163   -0.212   -0.212
## 60    flags ~~    wordm  1.856    3.068    3.068    0.100    0.100
## 61    flags ~~ addition  0.641   -8.191   -8.191   -0.054   -0.054
## 62    flags ~~ counting  0.054   -1.857   -1.857   -0.018   -0.018
## 63    flags ~~ straight  4.097   29.208   29.208    0.147    0.147
## 64 paragrap ~~ sentence  2.519    2.222    2.222    0.455    0.455
## 65 paragrap ~~    wordm  6.207   -4.923   -4.923   -0.645   -0.645
## 66 paragrap ~~ addition  6.015    6.818    6.818    0.182    0.182
## 67 paragrap ~~ counting  3.856   -4.173   -4.173   -0.163   -0.163
## 68 paragrap ~~ straight  0.187   -1.681   -1.681   -0.034   -0.034
## 69 sentence ~~    wordm  0.920    2.829    2.829    0.253    0.253
## 70 sentence ~~ addition  1.188   -4.459   -4.459   -0.081   -0.081
## 71 sentence ~~ counting  0.338    1.818    1.818    0.049    0.049
## 72 sentence ~~ straight  0.982    5.664    5.664    0.078    0.078
## 73    wordm ~~ addition  0.270   -3.226   -3.226   -0.038   -0.038
## 74    wordm ~~ counting  0.283    2.526    2.526    0.043    0.043
## 75    wordm ~~ straight  0.105   -2.810   -2.810   -0.025   -0.025
## 76 addition ~~ counting 33.716  245.021  245.021    0.854    0.854
## 77 addition ~~ straight  5.108 -153.592 -153.592   -0.276   -0.276
## 78 counting ~~ straight 14.764 -302.949 -302.949   -0.799   -0.799

Change the model

fact3A <- "
spatial =~ visual + cubes + flags + straight + addition
verbal =~ paragrap + sentence + wordm
speed =~ addition + counting + straight
"

fact3AfitRob <- cfa(fact3A, data = HS.data, se = "robust.sem", test = "satorra.bentler")
summary(fact3AfitRob, fit.measures = TRUE, standardized = TRUE)
## lavaan 0.6-9 ended normally after 200 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        23
##                                                       
##   Number of observations                           301
##                                                       
## Model Test User Model:
##                                               Standard      Robust
##   Test Statistic                                46.251      44.242
##   Degrees of freedom                                22          22
##   P-value (Chi-square)                           0.002       0.003
##   Scaling correction factor                                  1.045
##        Satorra-Bentler correction                                 
## 
## Model Test Baseline Model:
## 
##   Test statistic                               918.592     789.304
##   Degrees of freedom                                36          36
##   P-value                                        0.000       0.000
##   Scaling correction factor                                  1.164
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.973       0.970
##   Tucker-Lewis Index (TLI)                       0.955       0.952
##                                                                   
##   Robust Comparative Fit Index (CFI)                         0.973
##   Robust Tucker-Lewis Index (TLI)                            0.957
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -9558.556   -9558.556
##   Loglikelihood unrestricted model (H1)      -9535.431   -9535.431
##                                                                   
##   Akaike (AIC)                               19163.112   19163.112
##   Bayesian (BIC)                             19248.376   19248.376
##   Sample-size adjusted Bayesian (BIC)        19175.433   19175.433
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.061       0.058
##   90 Percent confidence interval - lower         0.036       0.033
##   90 Percent confidence interval - upper         0.085       0.082
##   P-value RMSEA <= 0.05                          0.220       0.271
##                                                                   
##   Robust RMSEA                                               0.059
##   90 Percent confidence interval - lower                     0.033
##   90 Percent confidence interval - upper                     0.084
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.042       0.042
## 
## Parameter Estimates:
## 
##   Standard errors                           Robust.sem
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   spatial =~                                                            
##     visual            1.000                               5.283    0.755
##     cubes             0.397    0.068    5.864    0.000    2.100    0.447
##     flags             1.014    0.140    7.234    0.000    5.355    0.593
##     straight          2.255    0.573    3.936    0.000   11.911    0.328
##     addition         -1.049    0.532   -1.971    0.049   -5.541   -0.222
##   verbal =~                                                             
##     paragrap          1.000                               2.965    0.850
##     sentence          1.489    0.089   16.725    0.000    4.415    0.857
##     wordm             2.163    0.140   15.491    0.000    6.413    0.838
##   speed =~                                                              
##     addition          1.000                              18.934    0.758
##     counting          0.775    0.141    5.486    0.000   14.678    0.726
##     straight          0.921    0.144    6.404    0.000   17.434    0.480
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   spatial ~~                                                            
##     verbal            6.864    1.441    4.764    0.000    0.438    0.438
##     speed            39.193   14.124    2.775    0.006    0.392    0.392
##   verbal ~~                                                             
##     speed            15.065    5.610    2.686    0.007    0.268    0.268
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .visual           20.990    4.397    4.774    0.000   20.990    0.429
##    .cubes            17.700    1.664   10.635    0.000   17.700    0.801
##    .flags            52.914    4.821   10.976    0.000   52.914    0.649
##    .straight        709.836   81.046    8.758    0.000  709.836    0.538
##    .addition        317.043   59.743    5.307    0.000  317.043    0.508
##    .paragrap          3.366    0.452    7.448    0.000    3.366    0.277
##    .sentence          7.064    0.922    7.663    0.000    7.064    0.266
##    .wordm            17.493    2.271    7.702    0.000   17.493    0.298
##    .counting        193.360   35.564    5.437    0.000  193.360    0.473
##     spatial          27.911    5.407    5.162    0.000    1.000    1.000
##     verbal            8.790    1.087    8.085    0.000    1.000    1.000
##     speed           358.496   94.408    3.797    0.000    1.000    1.000

Comparison of the new and original model

diff <- compareFit(fact3fitRob, fact3AfitRob)
summary(diff)
## ################### Nested Model Comparison #########################
## Scaled Chi-Squared Difference Test (method = "satorra.bentler.2001")
## 
## lavaan NOTE:
##     The "Chisq" column contains standard test statistics, not the
##     robust test that should be reported per model. A robust difference
##     test is a function of two standard (not robust) statistics.
##  
##              Df   AIC   BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
## fact3AfitRob 22 19163 19248 46.251                                  
## fact3fitRob  24 19198 19276 85.172     33.645       2  4.943e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## ####################### Model Fit Indices ###########################
##              chisq.scaled df.scaled pvalue.scaled rmsea.robust cfi.robust
## fact3AfitRob      44.242†        22          .003        .059†      .973†
## fact3fitRob       80.756         24          .000        .091       .932 
##              tli.robust  srmr        aic        bic
## fact3AfitRob      .957† .042† 19163.112† 19248.376†
## fact3fitRob       .898  .065  19198.034  19275.883 
## 
## ################## Differences in Fit Indices #######################
##                            df.scaled rmsea.robust cfi.robust tli.robust  srmr
## fact3fitRob - fact3AfitRob         2        0.032     -0.042     -0.059 0.023
##                               aic    bic
## fact3fitRob - fact3AfitRob 34.922 27.508

Lecture 5: Mixed-effect models

Theoretical part

Simulation of the data with hierarchical categories

set.seed(456)
# Fixed effects
alpha_0 <- 500  #intercept
beta_1 <- 50  #slope 
sigma <- 100  #sd

# by-intercept sd, by_slope sd and correlation between intercept and slope sd
tau_0 <- 30  # by-group random intercet (countries)

tau_1 <- 30  # by-group random slope (countries)
rho <- 0

n_babies <- 10

n_rfx <- faux::rnorm_multi(n = n_babies, mu = 0, sd = c(tau_0, tau_1), r = rho, varnames = c("T_0s",
    "T_1s"))

babies_rfx = data.frame(T_0s = rep(n_rfx$T_0s, each = 200), T_1s = rep(n_rfx$T_1s,
    each = 200))

Babies <- data.frame(Babies_id = rep(1:10, each = 200), babies_rfx)

Babies$Age = round(runif(2000, 1, 30))
Babies$Surounding = rnorm(2000, Babies$T_0s, 20)
Babies$Weight = rnorm(2000, 20, 10)
Babies$CalorieIntake = alpha_0 + Babies$T_0s + (beta_1 + Babies$T_1s) * Babies$Age +
    4 * Babies$Surounding + 5 * Babies$Weight + rnorm(2000, 0, sigma)
table(Babies$Babies_id)  #How many observations we have for each group
## 
##   1   2   3   4   5   6   7   8   9  10 
## 200 200 200 200 200 200 200 200 200 200

Complete pooling model

Linear model where we ignore information about the countries:

mod1CP <- lm(CalorieIntake ~ Age, data = Babies)
summary(mod1CP)
## 
## Call:
## lm(formula = CalorieIntake ~ Age, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1544.49  -226.93    79.93   295.13  1035.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  470.846     20.356   23.13   <2e-16 ***
## Age           49.603      1.143   43.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 428.1 on 1998 degrees of freedom
## Multiple R-squared:  0.485,  Adjusted R-squared:  0.4848 
## F-statistic:  1882 on 1 and 1998 DF,  p-value: < 2.2e-16

Looking at residuals:

par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(resid(mod1CP)[1:200], ylab = "Residuals", xlab = "Index")

No pooling model

Linear model where we estimates parameters for each country separately:

mod1NP <- lm(CalorieIntake ~ Age + factor(Babies_id) - 1, data = Babies)
summary(mod1NP)
## 
## Call:
## lm(formula = CalorieIntake ~ Age + factor(Babies_id) - 1, data = Babies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -809.29 -160.35   -6.41  162.10  855.07 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## Age                   50.2790     0.6783  74.124  < 2e-16 ***
## factor(Babies_id)1   102.5872    20.9058   4.907 9.99e-07 ***
## factor(Babies_id)2   684.2652    20.6108  33.199  < 2e-16 ***
## factor(Babies_id)3   862.3240    21.2219  40.634  < 2e-16 ***
## factor(Babies_id)4  -305.3031    20.8831 -14.620  < 2e-16 ***
## factor(Babies_id)5   487.4414    20.9251  23.295  < 2e-16 ***
## factor(Babies_id)6   132.7962    21.0883   6.297 3.72e-10 ***
## factor(Babies_id)7   659.7740    20.8901  31.583  < 2e-16 ***
## factor(Babies_id)8   656.3745    20.7328  31.659  < 2e-16 ***
## factor(Babies_id)9   679.1863    20.4171  33.266  < 2e-16 ***
## factor(Babies_id)10  642.8545    20.7242  31.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 253.2 on 1989 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9666 
## F-statistic:  5259 on 11 and 1989 DF,  p-value: < 2.2e-16

Looking at residuals:

par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(resid(mod1CP)[1:200], ylab = "Residuals", xlab = "Index")

Multilevel model: intercept

# install.packages('lme4')
require(lme4)

mult1 <- lmer(CalorieIntake ~ Age + (1 | Babies_id), data = Babies)
summary(mult1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: CalorieIntake ~ Age + (1 | Babies_id)
##    Data: Babies
## 
## REML criterion at convergence: 27858.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.1994 -0.6327 -0.0245  0.6410  3.3694 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  Babies_id (Intercept) 132275   363.7   
##  Residual               64118   253.2   
## Number of obs: 2000, groups:  Babies_id, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 460.2558   115.6422    3.98
## Age          50.2773     0.6783   74.12
## 
## Correlation of Fixed Effects:
##     (Intr)
## Age -0.092

Multilevel model: intercept and slope adjustments

mult2 <- lmer(CalorieIntake ~ Age + (1 + Age | Babies_id), data = Babies)
summary(mult2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: CalorieIntake ~ Age + (1 + Age | Babies_id)
##    Data: Babies
## 
## REML criterion at convergence: 25474.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7037 -0.6672  0.0194  0.6576  4.0965 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev. Corr 
##  Babies_id (Intercept) 35937.8  189.57        
##            Age           701.9   26.49   -0.50
##  Residual              18938.3  137.62        
## Number of obs: 2000, groups:  Babies_id, 10
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  469.579     60.307   7.786
## Age           50.048      8.386   5.968
## 
## Correlation of Fixed Effects:
##     (Intr)
## Age -0.497

Comparison of residuals

par(mfrow = c(1, 3), bty = "n", mar = c(5, 4, 1, 0.1), cex = 1.1, pch = 16)
plot(resid(mod1CP)[1:200], ylab = "Residuals", xlab = "Index", main = "Complete pooling")
plot(resid(mod1NP)[1:200], ylab = "Residuals", xlab = "Index", main = "No pooling")
plot(resid(mult2)[1:200], ylab = "Residuals", xlab = "Index", main = "Partial pooling")

Fixed and random effects

fixef(mult2)
## (Intercept)         Age 
##   469.57946    50.04775
ranef(mult2)
## $Babies_id
##    (Intercept)        Age
## 1    273.96266 -40.083989
## 2    -75.56727  19.518621
## 3     15.73010  22.659262
## 4   -130.81904 -40.356098
## 5    362.47248 -21.416831
## 6   -129.73763 -12.337377
## 7   -105.14316  18.855336
## 8     94.77461   6.176257
## 9   -209.91822  29.259636
## 10   -95.75452  17.725183
## 
## with conditional variances for "Babies_id"

Intra-class correlation (ICC)

mod1 <- lmer(CalorieIntake ~ Age + (1 | Babies_id), data = Babies)
performance::icc(mod1)
## # Intraclass Correlation Coefficient
## 
##      Adjusted ICC: 0.674
##   Conditional ICC: 0.354

Significance testing: fixed effects

# install.packages('lmerTest')
require(lmerTest)
mult2 <- lmer(CalorieIntake ~ Age + (1 + Age | Babies_id), data = Babies)
print(summary(mult2), cor = F)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: CalorieIntake ~ Age + (1 + Age | Babies_id)
##    Data: Babies
## 
## REML criterion at convergence: 25474.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.7037 -0.6672  0.0194  0.6576  4.0965 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev. Corr 
##  Babies_id (Intercept) 35937.8  189.57        
##            Age           701.9   26.49   -0.50
##  Residual              18938.3  137.62        
## Number of obs: 2000, groups:  Babies_id, 10
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  469.579     60.307   8.995   7.786 2.75e-05 ***
## Age           50.048      8.386   9.001   5.968 0.000211 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Practical aspect

# install.packages('foreign')
require(foreign)  #to read-in the data that are SPSS format (.sav) we need foreign package
mwell <- read.spss("data.sav", to.data.frame = T)  #read.spss function and we specify that we should read it as a data frame
dim(mwell)  #dimensions of our data (number of rows and columns)
## [1] 120115     74
mwell$Total = mwell$Watchwe_adj + mwell$Watchwk_adj + mwell$Comphwe_adj + mwell$Comphwk_adj +
    mwell$Smartwe_adj + mwell$Smartwk_adj  #total amount of time spent watching screen in a week

Number of missing values in our outcome:

table(is.na(mwell$mwb))
## 
##  FALSE   TRUE 
## 102580  17535

Numbe of missing values in our predictor:

table(is.na(mwell$Watchwk))
## 
##  FALSE   TRUE 
## 117638   2477

Density plots for our variables

par(mfrow = c(1, 2), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(density(mwell$mwb, na.rm = TRUE), main = "")
plot(density(mwell$Total, na.rm = T), main = "")

Subsetting the data - excluding NAs:

mwell2 = mwell[!is.na(mwell$mwb) & !is.na(mwell$Total), ]  # we are trying to subset our data frame and take only values that are not NAs. is.na is a boolean function that tells us whether one row is NA or not (TRUE or FALSE). !is.na indicates that we do not want TRUE na.values. We also have a logical parameter & that combines two conditions !is.na(outcome) & !is.na(predictor). Finally, we would like to filter our dataset by row and exclude all the rows that have either TRUE value in our predictor or our outcome. Therefore, we are looking at rows mwell[function goes here,] instead of mwell[,function goes here ] which would look at columns 
dim(mwell2)  #dimensions of the smaller data
## [1] 98853    75

Scatter plot:

cor(mwell2$mwb, mwell2$Total)
## [1] -0.1724534
par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(mwell2$Total[1:500], mwell2$mwb[1:500])

Number of observations for each of our potential random structures:

table(mwell2$Ethnicg)
## 
##            White            Mixed            Asian            Black 
##            76428             3827             9537             4413 
## Other or unknown 
##             4648
table(mwell2$REGION)
## 
## East Midlands                                                            
##                                                                     6440 
## East of England                                                          
##                                                                     7612 
## London                                                                   
##                                                                    18524 
## North East                                                               
##                                                                     6826 
## North West                                                               
##                                                                    15276 
## South East                                                               
##                                                                    13160 
## South West                                                               
##                                                                    10105 
## West Midlands                                                            
##                                                                    10328 
## Yorkshire and the Humber                                                 
##                                                                    10582

Building the model:

MWmod1 <- lmer(mwb ~ (1 | LANAME), data = mwell2)  #random effect of Local area
MWmod2 <- lmer(mwb ~ (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #crossed random effects of local area and ethnicity
MWmod3 <- lmer(mwb ~ (1 | REGION/LANAME) + (1 | Ethnicg), data = mwell2)  #nested random effect of local area that is nested in region and crossed with ethnicity
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00241727 (tol = 0.002, component 1)
anova(MWmod1, MWmod2, MWmod3)  #comparison of the models
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod1: mwb ~ (1 | LANAME)
## MWmod2: mwb ~ (1 | LANAME) + (1 | Ethnicg)
## MWmod3: mwb ~ (1 | REGION/LANAME) + (1 | Ethnicg)
##        npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)   
## MWmod1    3 726331 726359 -363162   726325                        
## MWmod2    4 726326 726364 -363159   726318 7.1022  1   0.007699 **
## MWmod3    5 726323 726371 -363157   726313 4.2095  1   0.040198 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Building the model: Fixed structure 1

MWmod2a <- lmer(mwb ~ Total + (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #main effect of total
print(summary(MWmod2a), cor = F)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mwb ~ Total + (1 | LANAME) + (1 | Ethnicg)
##    Data: mwell2
## 
## REML criterion at convergence: 723288.3
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.0004 -0.6215  0.0686  0.6822  3.0084 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  LANAME   (Intercept)  0.2160  0.4647  
##  Ethnicg  (Intercept)  0.2936  0.5418  
##  Residual             87.9915  9.3804  
## Number of obs: 98853, groups:  LANAME, 150; Ethnicg, 5
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  5.109e+01  2.594e-01  4.583e+00  196.96 3.53e-10 ***
## Total       -2.054e-01  3.695e-03  9.801e+04  -55.59  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Building the model: Fixed structure 2

MWmod2b <- lmer(mwb ~ Total + male + (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #main effect of total and main effect of sex
MWmod2c <- lmer(mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg), data = mwell2)  #interaction between total and sex
anova(MWmod2a, MWmod2b, MWmod2c)  #comparison of the models 
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod2a: mwb ~ Total + (1 | LANAME) + (1 | Ethnicg)
## MWmod2b: mwb ~ Total + male + (1 | LANAME) + (1 | Ethnicg)
## MWmod2c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
##         npar    AIC    BIC  logLik deviance   Chisq Df Pr(>Chisq)    
## MWmod2a    5 723288 723335 -361639   723278                          
## MWmod2b    6 717197 717254 -358592   717185 6093.02  1  < 2.2e-16 ***
## MWmod2c    7 716978 717044 -358482   716964  221.25  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(summary(MWmod2c), cor = F)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
##    Data: mwell2
## 
## REML criterion at convergence: 716985.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1959 -0.6181  0.0586  0.6695  3.3167 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  LANAME   (Intercept)  0.1892  0.4349  
##  Ethnicg  (Intercept)  0.3367  0.5802  
##  Residual             82.5528  9.0859  
## Number of obs: 98853, groups:  LANAME, 150; Ethnicg, 5
## 
## Fixed effects:
##                 Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)    4.884e+01  2.835e-01  5.101e+00  172.26 8.42e-11 ***
## Total         -1.979e-01  4.928e-03  9.864e+04  -40.17  < 2e-16 ***
## maleYes        2.921e+00  1.328e-01  9.881e+04   21.99  < 2e-16 ***
## Total:maleYes  1.084e-01  7.285e-03  9.880e+04   14.88  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random structure: random slopes

MWmod3c <- lmer(mwb ~ Total * male + (1 + Total | LANAME) + (1 | Ethnicg), data = mwell2)  #random slopes for total predictor for Local area and intercept for local area 
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 1 negative eigenvalues
## Warning: Model failed to converge with 1 negative eigenvalue: -8.9e+00
MWmod3c <- lmer(mwb ~ Total * male + (1 | LANAME:male) + (1 | Ethnicg), data = mwell2)  #random intercept for unique combination between local area and sex 
anova(MWmod2c, MWmod3c)  #comparison of the models
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod2c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
## MWmod3c: mwb ~ Total * male + (1 | LANAME:male) + (1 | Ethnicg)
##         npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## MWmod2c    7 716978 717044 -358482   716964                    
## MWmod3c    7 716995 717062 -358491   716981     0  0
MWmod3c <- lmer(mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg:male), data = mwell2)  #random intercepts for unique combination between Ethnicity and sex
anova(MWmod2c, MWmod3c)
## refitting model(s) with ML (instead of REML)
## Data: mwell2
## Models:
## MWmod2c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg)
## MWmod3c: mwb ~ Total * male + (1 | LANAME) + (1 | Ethnicg:male)
##         npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## MWmod2c    7 716978 717044 -358482   716964                    
## MWmod3c    7 716955 717022 -358471   716941 22.41  0
summary(MWmod3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: mwb ~ (1 | REGION/LANAME) + (1 | Ethnicg)
##    Data: mwell2
## 
## REML criterion at convergence: 726315.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.6003 -0.6199  0.0771  0.6881  2.4872 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  LANAME:REGION (Intercept)  0.18076 0.4252  
##  REGION        (Intercept)  0.03315 0.1821  
##  Ethnicg       (Intercept)  0.08082 0.2843  
##  Residual                  90.74939 9.5262  
## Number of obs: 98853, groups:  LANAME:REGION, 150; REGION, 9; Ethnicg, 5
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  47.5569     0.1568  5.2742   303.3 2.16e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## optimizer (nloptwrap) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00241727 (tol = 0.002, component 1)

Visualisation of the random structure:

require(sjPlot)
plot_model(MWmod3c, type = "re", sort.est = "sort.all", grid = FALSE)[1]  #type='re' gives us random adjustments, while [1] indicates that we want only first plot
## [[1]]

plot_model(MWmod3c, type = "re", sort.est = "sort.all", grid = FALSE)[2]
## [[1]]

Visualisation of the fixed effects (interaction):

plot_model(MWmod3c, type = "int")

Significance of the random effects

ranova(MWmod3c)
## ANOVA-like table for random-effects: Single term deletions
## 
## Model:
## mwb ~ Total + male + (1 | LANAME) + (1 | Ethnicg:male) + Total:male
##                    npar  logLik    AIC    LRT Df Pr(>Chisq)    
## <none>                7 -358480 716974                         
## (1 | LANAME)          6 -358523 717059 86.708  1  < 2.2e-16 ***
## (1 | Ethnicg:male)    6 -358525 717061 88.990  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explained variance - R2

# install.packages('MuMIn')
require(MuMIn)
r.squaredGLMM(MWmod3c)  # m stands for marginal, while c stands for conditional. Marginal is approximation of the explained variance by fixed-effect structure, while conditional is with both fixed and random-effect structure
##             R2m        R2c
## [1,] 0.08319687 0.08936068

Predictions of the model

mwell2$predicted = predict(MWmod3c)  #prediction values from model to our train dataset
par(mfrow = c(1, 1), bty = "n", mar = c(5, 4, 0.1, 0.1), cex = 1.1, pch = 16)
plot(mwell2$predicted, mwell2$mwb)  #plot predicted versus observed data points