Non-alphabetical glossary for Advanced Statistical Methods for Psychologists (PSY6210) at The Sheffield University

Probability distribution

Mathematical functions that describe the probabilities of occurrence of different possible outcomes: link. For example, Binomial distribution is a discrete probability distribution (1 or 0), thus it has countable number of outcomes (contrary continuous probability distribution can take any value on a specific range of values - Gaussian). The binomial distribution has two parameters and we can write it as \(X \sim B(n,p)\) - n number of events and p probability of a successful trial. The Gaussian distribution can also be described with two parameters \(X \sim N(\mu, \sigma)\) - \(\mu\) - arithmetic mean and \(\sigma\) - dispersion around the mean or standard deviation.

Statistical moments

Relates to the concept of “moments” in physics (e.g. moment of inertia). Moments quantify distributional shapes, their location (mean), shape (overall geometry), and scale (spread). Check this great blog by Gregory Gundersen.

Arithmetic mean, variance, standard deviation, co-variance

Mean (\(\mu\)) - measure of central tendency
Variance (\(\sigma^2\)) - measure of dispersion around the mean. Expected value of the squared deviation from the mean - \(Var(X)=E[(X-\mu)^2]\). You can look at it as covariance of a variable with itself - \(Cov(X,X)\)
Standard deviation (\(\sigma\)) - measure of dispersion around the mean. Square root of the variance

Covariance \(cov(X,Y)\) - measure of joint variability of two variables.

Linear regression

Equation that summarises how the average values of dependent variable vary over values defined by linear function of a predictor.

Partial correlation

Measure of the degree of association (correlation) between two variables, with the influence/effect of other variables (eg. X2) controlled for both variables (eg. Y and X1).

For example, when want to control for the effect Weight when estimating correlation between Age and Height of Babies.

Just a simple correlation would be:

cor(Babies$Age, Babies$Height)
## [1] 0.21957

The partial correlation would be:

require(ppcor)
pcor(Babies[,1:3])$estimate
##                Age      Weight    Height
## Age     1.00000000 -0.04780643 0.2206629
## Weight -0.04780643  1.00000000 0.3885231
## Height  0.22066292  0.38852308 1.0000000

We can calculate this in steps:

AgeRes=residuals(lm(Age~Weight, data=Babies)) # we take residuals of Age when modeled by Weight (take out the part of variance in Age explained by Weight)
HeightRes=residuals(lm(Height~Weight, data=Babies)) # we take residuals of Height when predicted by Weight (take out the part of variance in Height explained by Weight)
cor(HeightRes, AgeRes) # We calculate correlation
## [1] 0.2206629

Dummy coding of the categorical variables

Different ways how we can transform labels (categorical values) in numbers (1s and 0s) with the goal to use them in statistical models. Check different ways of dummy coding in R: link

Errors and residuals

Link: Errors of an observed values is the deviation of the values from the true (unknown) quantity of interest (e.g. population mean). The residuals is the difference between the observed values and the estimated value of the quantity of interest (e.g. sample mean)

Residual standard error

Measure of the model fit, where we measure standard deviation of the residuals in a regression model: \(\sqrt(\frac{SS_{residual}}{df})\)

Coefficient of determination (\(R^2\))

Proportion of the variation in the dependent variable explained by the independent variable(s): link. \(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)

\(SS_{residual}\) - we need to calculate the distances between predicted and observed values (vertical lines), square and sum them up.

## 
## Call:
## lm(formula = Height ~ Age, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4765  -4.1601  -0.3703   3.9198  12.3842 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 57.02580    1.18751  48.021   <2e-16 ***
## Age          0.14317    0.06426   2.228   0.0282 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.283 on 98 degrees of freedom
## Multiple R-squared:  0.04821,    Adjusted R-squared:  0.0385 
## F-statistic: 4.964 on 1 and 98 DF,  p-value: 0.02817


\(SS_{total}\): To get total amount of variance, we can just calculate distances from the mean of the dependent variable, square and sum the up

## 
## Call:
## lm(formula = Height ~ 1, data = Babies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4165  -4.2284  -0.2062   3.6744  13.5940 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.3953     0.5388   110.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.388 on 99 degrees of freedom

1-(sum(Babies$SSresid^2)/sum(Babies$SStotal^2))
## [1] 0.04821098

Semi-partial correlation

Similar to the partial correlation, but compares relation of two variables where influence of controls (third variable: X2) is held constant only for one variable and not both (only for X1 and not for Y). This way we get how much unique variation was explained by X1 in Y when accounting for the variance already explained by the X2. In other words, we get a unique contribution of predictors to the explained variance - can be used to compare importance/contribution of predictors. Link

F-statistic

Test that calculates whether our model is significant. In particular, does it explain/predict dependent variable better than null model, which in this case is only intercept model (just knowing the mean of the dependent variable).

\(F = \frac{SS_m/df_m}{SS_r/df_r}\)

We need to calculate Sum of squares for our model (\(SS_{model}\)) - distances between the regression line proposed by our model and the null model (only intercept model).

(sum(Babies$SSmodel^2)/1)/(sum(Babies$SSresid^2)/98)
## [1] 4.963995

Structural equation modelling

Modelling framework used to test relationship between variables. The SEM model is composed of two parts: measurement model and structural model. It is often used (should be used) to test and evaluate multivariate causal relationships. Check more here: link

Total number of parameters

Number of free parameters that we can estimate using specific number of elements (variables):
\(\frac{variables*(variables+1)}{2}\)

Number of parameters estimated in the proposed model:

It depends on the underlying model - always check the degrees of freedom and compare them with the drawing of the model. For example:
Model Parameters

Model identification

Model coefficients can only be estimated if the model has:
1. The same number of total and estimated parameters (just-identified)
2. More total than estimated parameters (over-identified model)
Alternatively, it will under-identified.

Fit Indices of SEM model

Fit Indices

Confirmatory Factor Analysis (CFA)

Model that investigates latent space of measures (measurement model). In comparison to the exploratory factor analysis, CFA is theory driven approach. We have strong assumptions and expectations how many factors are there and how are individual measures loading onto these latent factors.
\(y_1=\tau_1+\lambda_1*\eta+\epsilon_1\)

\(\tau\) - the item intercepts or means
\(\lambda\) - factor loadings - regression coefficients
\(\epsilon\) - error variances and covariances
\(\eta\) - the latent predictor of the items
\(\psi\) - factor variances and covariances

Type of factors: reflective and formative

A reflective measurement model happens when the indicators of a construct are considered to be caused by that construct. A formative measurement model happens when the measured variables are considered to be the cause of the latent variable.

Reflective vs Formative

Defining the scale of latent variables

We have different options, most common ones are:

1. Marker variable: single factor loading constraint to 1

2. Standardized latent variables: setting variance of variable to 1 (Z-score)

3. Effects-coding: constraints that all of the loadings to one LV average 1.0 or that their sum is equal to number of indicators

Full SEM model

Combination between measurement and structural model:

Home advantage

Measurement invariance

Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups:

1. Configural invarience: Model fitted for each group separately

2. Metric invariance: restriction of the factor loadings, but intercepts are allowed to vary

3. Scalar invariance: restriction of the both, factor loadings and intercepts

4. Strict invariance: restriction on factor loadings, intercepts and residual variances