Lecture 5: SEM (Path models)

class: center, middle, inverse, title-slide

.title[
# Lecture 5: SEM (Path models)
]
.author[
### Dr Nemanja Vaci
]
.institute[
### University of Sheffield
]
.date[
### 2026-03-18
]

---

## Press record

---

## Intended learning outcomes

Motivate utilisation of path and CFA models; Argue how they connect to other models that we covered at the course.

Calculate number of free parameters and degrees of freedom of the proposed model.

Build a model in R statistical environment, estimate, and interpret the coefficients.

Criticise, modify, compare, and evaluate the fit of the proposed models.

---
## Structural equation modelling (SEM)

General framework that uses various models to test relationships among variables

Other terms: covariance structure analysis, covariance structure modelling, __causal modelling__

Sewell Wright - "mathematical tool for drawing __causal__ conclusions from a combination of of observational data and __theoretical assumptions__"

Waves:
 1. Causal modelling through path models 
 2. Latent structures - factor analysis 
 3. Structural causal models 
 
SEM is a general modelling framework that is composed of measurement model and the structural model.

???
Judea Pearl - [The Causal Foundations of Structural Equation Modeling](https://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf)

Measurement model focuses on the estimation of latent or composite variables 
Structural model focuses on the estimation of relations between manifest and/or latent variables in the model (path model)

Terminology:

Manifest variables: observed/collected variables 
Latent variables: infered measures - hypothetical constructs 
 - Indicator variables: measures used to infer the latent concepts

Endogenous variables: dependent outcomes 
Exogenous variables: predictors

Focus on covariance structure instead of mean 
---
## How do multiple causes combine to influence an outcome?
.pull-left[
Decompose correlations into causal pathways

H - heritability components  
G - genetic pathways  
E, D - environment  
Chance - random variation

To compute an effect:
1. Trace a path from X -> Y
2. Multiply coefficients along the path
3. Sum across all valid paths

If interested in the effect of Heritability on the Outcome:
H" -> G (b) __*__ G -> H (a) __*__ H -> O (h) __+__ H" -> G" (b) __*__ G" -> H (a)
]

.pull-right[
<img src="Wright.png" width="90%">
]
---
## Structural part of the model (path analysis)

Model that test relationship between set of variables, often arranged in some sort of structural form.

Path modelling = multiple regressions + causal structure 
- System of regressions/equations 
- Variables can be both predictors and outcomes simultaneously 
- You are estimating direct and indirect effects

???
.center[
<img src="graphical.png", width = "120%"> 
]
---

## First step: Specification of the model

Early childhood experiences influence cognitive development. Children who engage more frequently in structured activities (e.g., reading or drawing) tend to develop stronger cognitive abilities. One explanation is that these activities increase social engagement (e.g., interaction with caregivers), which supports learning.

In addition, structured activities may also improve attention and self-regulation, which further enhances cognitive development. Nutrition also supports brain development and contributes directly to cognitive abilities, but also it's effects can be mediated through improved attention.

Researchers therefore hypothesise that structured activities and nutrition influence cognitive abilities both directly and indirectly through multiple pathways, including social engagement and attention.

???
Representation of our hypothetical assumptions in the form of the structural equation model
---

## First step: Specification of the model

.center[
<img src="GeneralExample.png", width = "80%"> 
]

---
## Can model be estimated?

Total Number of the parameters that we can estimate: `\(\frac{variables*(variables+1)}{2}\)` 
Number of moments = 15

.center[
<img src="GeneralExample.png", width = "70%"> 
]

---
## Number of observations

``` r
Matrix<-cov(Babies[,c('Nutrition','StructuredAct','Attention','SocialEngagement','CognitiveAb')])
Matrix[upper.tri(Matrix)]<-NA
knitr::kable(Matrix, format = 'html')
```

<table>
 <thead>
 <tr>
 <th style="text-align:left;"> </th>
 <th style="text-align:right;"> Nutrition </th>
 <th style="text-align:right;"> StructuredAct </th>
 <th style="text-align:right;"> Attention </th>
 <th style="text-align:right;"> SocialEngagement </th>
 <th style="text-align:right;"> CognitiveAb </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> Nutrition </td>
 <td style="text-align:right;"> 29.36708 </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 </tr>
 <tr>
 <td style="text-align:left;"> StructuredAct </td>
 <td style="text-align:right;"> -13.15898 </td>
 <td style="text-align:right;"> 2524.05387 </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 </tr>
 <tr>
 <td style="text-align:left;"> Attention </td>
 <td style="text-align:right;"> 74.13537 </td>
 <td style="text-align:right;"> -26.31428 </td>
 <td style="text-align:right;"> 248.08556 </td>
 <td style="text-align:right;"> NA </td>
 <td style="text-align:right;"> NA </td>
 </tr>
 <tr>
 <td style="text-align:left;"> SocialEngagement </td>
 <td style="text-align:right;"> -17.27102 </td>
 <td style="text-align:right;"> 1583.62222 </td>
 <td style="text-align:right;"> -56.02618 </td>
 <td style="text-align:right;"> 1219.1171 </td>
 <td style="text-align:right;"> NA </td>
 </tr>
 <tr>
 <td style="text-align:left;"> CognitiveAb </td>
 <td style="text-align:right;"> 148.00214 </td>
 <td style="text-align:right;"> 1240.22693 </td>
 <td style="text-align:right;"> 456.96534 </td>
 <td style="text-align:right;"> 858.7603 </td>
 <td style="text-align:right;"> 1726.013 </td>
 </tr>
</tbody>
</table>

---
## How many parameters are we estimating?

.center[
<img src="ModelParameters.png", width = "70%"> 
]

Parameters that we are estimating = variances (V1, V2) + covariances (2) + regression pathways (a1, a2, b1, b2, c1, c2) + residuals (E1, E2, E3) = 13
---

## Second step: model identification

1. Under-indentified: more free parameters than total possible parameters 
2. Just-identified: equal number of free parameters and total possible parameters 
3. Over-identified: fewer free parameters than total possible parameters 
 
Parameters can either be: free, fixed or constrained

---

## Third step: estimation of the model

pre[class] {
 max-height: 80px;
}
</style>

``` r
modelAbility <- '
 Attention ~ Nutrition
 SocialEngagement ~ StructuredAct
 CognitiveAb ~ Nutrition + StructuredAct + Attention + SocialEngagement
 Attention ~~ SocialEngagement
 Nutrition ~~ StructuredAct
'
```
--

``` r
fit1<-sem(modelAbility, data=Babies)
summary(fit1)
```

---

## Step four: model evaluation

Chi-square test: measure of how well model-implied covariance matrix fits data covariance

We would prefer not to reject the null hypothesis in this case

Assumptions: 
Multivariate normality 
N is sufficiently large (150+) 
Parameters are not at boundary or invalid (e.g. variance of zero)

With the large samples it is sensitive to small misfits 
Nonormality induces bias 
---

## Other fit indices

``` r
summary(fit1, fit.measures=TRUE)
```

```
## lavaan 0.6-21 ended normally after 43 iterations
## 
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
## 
## Number of observations 100
## 
## Model Test User Model:
## 
## Test statistic 1.266
## Degrees of freedom 2
## P-value (Chi-square) 0.531
## 
## Model Test Baseline Model:
## 
## Test statistic 603.188
## Degrees of freedom 10
## P-value 0.000
## 
## User Model versus Baseline Model:
## 
## Comparative Fit Index (CFI) 1.000
## Tucker-Lewis Index (TLI) 1.006
## 
## Loglikelihood and Information Criteria:
## 
## Loglikelihood user model (H0) -1970.332
## Loglikelihood unrestricted model (H1) -1969.699
## 
## Akaike (AIC) 3966.663
## Bayesian (BIC) 4000.530
## Sample-size adjusted Bayesian (SABIC) 3959.473
## 
## Root Mean Square Error of Approximation:
## 
## RMSEA 0.000
## 90 Percent confidence interval - lower 0.000
## 90 Percent confidence interval - upper 0.173
## P-value H_0: RMSEA <= 0.050 0.608
## P-value H_0: RMSEA >= 0.080 0.295
## 
## Standardized Root Mean Square Residual:
## 
## SRMR 0.020
## 
## Parameter Estimates:
## 
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
## 
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## Attention ~ 
## Nutrition 2.501 0.143 17.547 0.000
## SocialEngagement ~ 
## StructuredAct 0.628 0.030 21.237 0.000
## CognitiveAb ~ 
## Nutrition 1.699 0.362 4.688 0.000
## StructuredAct 0.084 0.045 1.853 0.064
## Attention 1.498 0.126 11.905 0.000
## SocialEngagmnt 0.688 0.065 10.514 0.000
## 
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .Attention ~~ 
## .SocialEngagmnt -16.802 11.728 -1.433 0.152
## Nutrition ~~ 
## StructuredAct -13.027 26.985 -0.483 0.629
## 
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .Attention 60.342 8.534 7.071 0.000
## .SocialEngagmnt 223.279 31.576 7.071 0.000
## .CognitiveAb 93.590 13.236 7.071 0.000
## Nutrition 29.073 4.112 7.071 0.000
## StructuredAct 2498.813 353.386 7.071 0.000
```

---

## Other fit indices
.center[
<img src="fitInd.png", width = "60%">
]

???
TLI: fit of .95 indicates that the fitted model improves the fit by 95% relative to the null mode, works OK with smaller sample sizes 
CFI: Same as TLI, but not very sensitive to sample size 
RMSEA: difference between the residuals of the sample covariance matrix and hypothesized model. If we have different scales it is hard to interpret, then we can check standardised root mean square residual (SRMR) 
---

## Direct and indirect

.center[
<img src="simplified.png", width = "60%"> 
]

Direct effect (c): subgroups/cases that differ by one unit on X, but are equal on M are estimated to differ by __c__ units on Y.

Indirect effect: 
 a) X -> M: cases that differ by one unit in X are estimated to differ by __a__ units on M 
 b) M -> Y: cases that differ by one unit in M, but are equal on X, are estimated to differ by __b__ units on Y 
The indirect effect of X on Y through M is a product of __a__ and __b__. The two cases that differ by one unit on X are estimated to differ by __ab__ units on Y as a result of the effect of X on M which affects Y.

---
## Direct and indirect

``` r
modelAbility <- '
 Attention ~ a1*Nutrition
 SocialEngagement ~ a2*StructuredAct
 CognitiveAb ~ c1*Nutrition + c2*StructuredAct + b1*Attention + b2*SocialEngagement
 Attention ~~ SocialEngagement
 Nutrition ~~ StructuredAct

indirectAtt := a1*b1
 directAtt := c1
 totalAtt := indirectAtt+directAtt
 
 indirectAct := a2*b2
 directAct := c2
 totalAct := indirectAct+directAct
'
fitPath<-sem(modelAbility, data=Babies)
summary(fitPath)
```

```
## lavaan 0.6-21 ended normally after 43 iterations
## 
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 13
## 
## Number of observations 100
## 
## Model Test User Model:
## 
## Test statistic 1.266
## Degrees of freedom 2
## P-value (Chi-square) 0.531
## 
## Parameter Estimates:
## 
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
## 
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## Attention ~ 
## Nutrition (a1) 2.501 0.143 17.547 0.000
## SocialEngagement ~ 
## StrctrdAc (a2) 0.628 0.030 21.237 0.000
## CognitiveAb ~ 
## Nutrition (c1) 1.699 0.362 4.688 0.000
## StrctrdAc (c2) 0.084 0.045 1.853 0.064
## Attention (b1) 1.498 0.126 11.905 0.000
## SclEnggmn (b2) 0.688 0.065 10.514 0.000
## 
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## .Attention ~~ 
## .SocialEngagmnt -16.802 11.728 -1.433 0.152
## Nutrition ~~ 
## StructuredAct -13.027 26.985 -0.483 0.629
## 
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .Attention 60.342 8.534 7.071 0.000
## .SocialEngagmnt 223.279 31.576 7.071 0.000
## .CognitiveAb 93.590 13.236 7.071 0.000
## Nutrition 29.073 4.112 7.071 0.000
## StructuredAct 2498.813 353.386 7.071 0.000
## 
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|)
## indirectAtt 3.748 0.380 9.852 0.000
## directAtt 1.699 0.362 4.688 0.000
## totalAtt 5.447 0.279 19.518 0.000
## indirectAct 0.432 0.046 9.423 0.000
## directAct 0.084 0.045 1.853 0.064
## totalAct 0.516 0.028 18.377 0.000
```
???
Interaction between the predictors can be included similar to the linear regression model by using (:) sign.

modelAbilityInteraction<- 
SocialBeh~Nutrition+PhyExer+GMA+__PhyExer:GMA__ 
CognitiveAb~SocialBeh+Nutrition+GMA

---

## Model modification

Add/take out theoretical pathways:

``` r
modelAbility2 <- '
 SocialEngagement ~ StructuredAct
 CognitiveAb ~ Nutrition + StructuredAct + Attention + SocialEngagement
 Attention ~~ SocialEngagement
 Nutrition ~~ StructuredAct
'

fit2<-sem(modelAbility2, data=Babies)

lavTestLRT(fit1,fit2)
```

```
## 
## Chi-Squared Difference Test
## 
## Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq) 
## fit1 2 3966.7 4000.5 1.2659 
## fit2 3 4104.3 4135.6 140.9194 139.65 1.1775 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

## Or check modification indices

``` r
modindices(fit2, sort=TRUE)
```

```
##                 lhs op              rhs     mi    epc sepc.lv sepc.all sepc.nox
## 31        Nutrition  ~        Attention 75.335  0.298   0.298    0.867    0.867
## 35        Attention  ~        Nutrition 74.330  2.471   2.471    0.850    0.850
## 20        Nutrition ~~        Attention 74.226 71.693  71.693    0.848    0.848
## 29        Nutrition  ~      CognitiveAb 64.239  0.180   0.180    1.245    1.245
## 33        Attention  ~      CognitiveAb  7.058  0.155   0.155    0.368    0.368
## 28        Nutrition  ~ SocialEngagement  1.230 -0.040  -0.040   -0.257   -0.257
## 21 SocialEngagement  ~      CognitiveAb  0.121  0.056   0.056    0.060    0.060
## 22 SocialEngagement  ~        Nutrition  0.121  0.095   0.095    0.015    0.015
## 15 SocialEngagement ~~        Nutrition  0.121  2.757   2.757    0.034    0.034
## 34        Attention  ~    StructuredAct  0.111 -0.010  -0.010   -0.033   -0.033
## 32        Attention  ~ SocialEngagement  0.111 -0.017  -0.017   -0.037   -0.037
## 25    StructuredAct  ~      CognitiveAb  0.009  0.022   0.022    0.016    0.016
## 27    StructuredAct  ~        Attention  0.008  0.028   0.028    0.009    0.009
## 24    StructuredAct  ~ SocialEngagement  0.000  0.022   0.022    0.015    0.015
```

---

## Prerequisites

Theory: Strong theoretical assumptions that could be used to draw causal assumptions that could be tested using the data and specification of the model

Data: large samples, N:p rule - 20:1, more data usually better estimates. 
 - We are not that interested in significance: 
 a) Overall behaviour of the model more interesting 
 b) More data higher probability of significant results (weak effects) 
 c) Latent models are estimated by anchoring on indicator variables, different estimation can result in different patterns

---
## Problems with SEM and alternatives

1. Variables derived from the normal distribution 
2. Observations independent 
3. Large sample size

---
## PiecewiseSEM
.center[
<img src="PiecewiseSEM.png", width = "50%"> 
]

Variables are causally dependent if there is an arrow between them 
They are causally independent if there are no arrows between them 
 
X1 is causally independent from Y2 _conditional_ on Y1

PiecewiseSEM performs a test of directional separation (d-sep) and asks whether causally independent paths are significant when controlling for variables on which causal process is conditional.

???
https://jonlefcheck.net/2014/07/06/piecewise-structural-equation-modeling-in-ecological-research/

---

## PiecewiseSEM

``` r
#install.packages('piecewiseSEM)
require(piecewiseSEM)
model1<-psem(lm(SocialEngagement~Nutrition, data=Babies),
 lm(CognitiveAb~SocialEngagement+Nutrition+StructuredAct+Attention, data=Babies))
summary(model1, .progressBar=FALSE)
```

```
## 
## Structural Equation Model of model1 
## 
## Call:
##   SocialEngagement ~ Nutrition
##   CognitiveAb ~ SocialEngagement + Nutrition + StructuredAct + Attention
## 
##     AIC
##  1748.214
## 
## ---
## Tests of directed separation:
## 
##                           Independ.Claim Test.Type DF Crit.Value P.Value    
##   SocialEngagement ~ StructuredAct + ...      coef 97    20.7229  0.0000 ***
##       SocialEngagement ~ Attention + ...      coef 97    -0.4514  0.6527    
## 
## --
## Global goodness-of-fit:
## 
## Chi-Squared = 171.231 with P-value = 0 and on 2 degrees of freedom
## Fisher's C = 169.754 with P-value = 0 and on 4 degrees of freedom
## 
## ---
## Coefficients:
## 
##           Response        Predictor Estimate Std.Error DF Crit.Value P.Value
##   SocialEngagement        Nutrition  -0.5881    0.6481 98    -0.9074  0.3664
##        CognitiveAb SocialEngagement   0.6880    0.0675 95    10.1863  0.0000
##        CognitiveAb        Nutrition   1.6992    0.3726 95     4.5601  0.0000
##        CognitiveAb    StructuredAct   0.0842    0.0468 95     1.8010  0.0749
##        CognitiveAb        Attention   1.4985    0.1292 95    11.6024  0.0000
##   Std.Estimate    
##        -0.0913    
##         0.5782 ***
##         0.2216 ***
##         0.1018    
##         0.5681 ***
## 
##   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05
## 
## ---
## Individual R-squared:
## 
##           Response method R.squared
##   SocialEngagement   none      0.01
##        CognitiveAb   none      0.95
```

---

## Important aspects: theory

- Difference between moderation and mediation 
- Interpretation of the predictors 
- Calculation of free parameters and total parameters 
- Model identification: three-types of identifications 
- Overall fit of the model

---

## Important aspects: practice

- Building path model: both continous and categorical exogenous variables 
- Calculation of the direct and indirect pathways for predictors of interest 
- Adding an interaction to path model 
- Interpretation of the coefficients 
- Getting fit indices of the model

---
## Literature

Chapters 1 to 5 of Principles and Practice of Structural Equation Modeling by Rex B. Kline

Introduction to Mediation, Moderation, and Conditional Process Analysis: A Regression-Based Approach by Andrew F. Hayes

Latent Variable Modeling Using R: A Step-by-Step Guide by A. Alexander Beaujean

---
## Practical part

RW3D ("Real World Worry Waves Dataset") is a multi-modal longitudinal dataset collected in the UK to study people's emotional and psychological responses to the COVID-19 pandemic over time. [Paper](https://www.nature.com/articles/s41597-023-02438-y?utm_source=chatgpt.com#Sec25)

Worry about Covid -> Anxiety -> Worry about mental health

Worry about Covid -> Fear -> Worry about mental health

[Data](https://osf.io/9b85r/)

---

# Thank you for your attention