Mathematical functions that describe the probabilities of occurrence of different possible outcomes: link. For example, Binomial distribution is a discrete probability distribution (1 or 0), thus it has countable number of outcomes (contrary continuous probability distribution can take any value on a specific range of values - Gaussian). The binomial distribution has two parameters and we can write it as \(X \sim B(n,p)\) - n number of events and p probability of a successful trial. The Gaussian distribution can also be described with two parameters \(X \sim N(\mu, \sigma)\) - \(\mu\) - arithmetic mean and \(\sigma\) - dispersion around the mean or standard deviation.
Relates to the concept of “moments” in physics (e.g. moment of inertia). Moments quantify distributional shapes, their location (mean), shape (overall geometry), and scale (spread). Check this great blog by Gregory Gundersen.
Mean (\(\mu\)) - measure of central tendency
Variance (\(\sigma^2\)) - measure of dispersion around the mean. Expected value of the squared deviation from the mean - \(Var(X)=E[(X-\mu)^2]\). You can look at it as covariance of a variable with itself - \(Cov(X,X)\)
Standard deviation (\(\sigma\)) - measure of dispersion around the mean. Square root of the variance
Covariance \(cov(X,Y)\) - measure of joint variability of two variables.
Equation that summarises how the average values of dependent variable vary over values defined by linear function of a predictor.
Measure of the degree of association (correlation) between two variables, with the influence/effect of other variables (eg. X2) controlled for both variables (eg. Y and X1).
For example, when want to control for the effect Weight when estimating correlation between Age and Height of Babies.
Just a simple correlation would be:
cor(Babies$Age, Babies$Height)
## [1] 0.21957
The partial correlation would be:
require(ppcor)
pcor(Babies[,1:3])$estimate
## Age Weight Height
## Age 1.00000000 -0.04780643 0.2206629
## Weight -0.04780643 1.00000000 0.3885231
## Height 0.22066292 0.38852308 1.0000000
We can calculate this in steps:
AgeRes=residuals(lm(Age~Weight, data=Babies)) # we take residuals of Age when modeled by Weight (take out the part of variance in Age explained by Weight)
HeightRes=residuals(lm(Height~Weight, data=Babies)) # we take residuals of Height when predicted by Weight (take out the part of variance in Height explained by Weight)
cor(HeightRes, AgeRes) # We calculate correlation
## [1] 0.2206629
Different ways how we can transform labels (categorical values) in numbers (1s and 0s) with the goal to use them in statistical models. Check different ways of dummy coding in R: link
Link: Errors of an observed values is the deviation of the values from the true (unknown) quantity of interest (e.g. population mean). The residuals is the difference between the observed values and the estimated value of the quantity of interest (e.g. sample mean)
Measure of the model fit, where we measure standard deviation of the residuals in a regression model: \(\sqrt(\frac{SS_{residual}}{df})\)
Proportion of the variation in the dependent variable explained by the independent variable(s): link. \(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)
\(SS_{residual}\) - we need to calculate the distances between predicted and observed values (vertical lines), square and sum them up.
##
## Call:
## lm(formula = Height ~ Age, data = Babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.4765 -4.1601 -0.3703 3.9198 12.3842
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.02580 1.18751 48.021 <2e-16 ***
## Age 0.14317 0.06426 2.228 0.0282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.283 on 98 degrees of freedom
## Multiple R-squared: 0.04821, Adjusted R-squared: 0.0385
## F-statistic: 4.964 on 1 and 98 DF, p-value: 0.02817
\(SS_{total}\): To get total amount of variance, we can just calculate distances from the mean of the dependent variable, square and sum the up
##
## Call:
## lm(formula = Height ~ 1, data = Babies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.4165 -4.2284 -0.2062 3.6744 13.5940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.3953 0.5388 110.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.388 on 99 degrees of freedom
1-(sum(Babies$SSresid^2)/sum(Babies$SStotal^2))
## [1] 0.04821098
Similar to the partial correlation, but compares relation of two variables where influence of controls (third variable: X2) is held constant only for one variable and not both (only for X1 and not for Y). This way we get how much unique variation was explained by X1 in Y when accounting for the variance already explained by the X2. In other words, we get a unique contribution of predictors to the explained variance - can be used to compare importance/contribution of predictors. Link
Test that calculates whether our model is significant. In particular, does it explain/predict dependent variable better than null model, which in this case is only intercept model (just knowing the mean of the dependent variable).
\(F = \frac{SS_m/df_m}{SS_r/df_r}\)
We need to calculate Sum of squares for our model (\(SS_{model}\)) - distances between the regression line proposed by our model and the null model (only intercept model).
(sum(Babies$SSmodel^2)/1)/(sum(Babies$SSresid^2)/98)
## [1] 4.963995
Modelling framework used to test relationship between variables. The SEM model is composed of two parts: measurement model and structural model. It is often used (should be used) to test and evaluate multivariate causal relationships. Check more here: link
Number of free parameters that we can estimate using specific number of elements (variables):
\(\frac{variables*(variables+1)}{2}\)
It depends on the underlying model - always check the degrees of freedom and compare them with the drawing of the model. For example:
Model coefficients can only be estimated if the model has:
1. The same number of total and estimated parameters (just-identified)
2. More total than estimated parameters (over-identified model)
Alternatively, it will under-identified.
Model that investigates latent space of measures (measurement model). In comparison to the exploratory factor analysis, CFA is theory driven approach. We have strong assumptions and expectations how many factors are there and how are individual measures loading onto these latent factors.
\(y_1=\tau_1+\lambda_1*\eta+\epsilon_1\)
\(\tau\) - the item intercepts or means
\(\lambda\) - factor loadings - regression coefficients
\(\epsilon\) - error variances and covariances
\(\eta\) - the latent predictor of the items
\(\psi\) - factor variances and covariances
A reflective measurement model happens when the indicators of a construct are considered to be caused by that construct. A formative measurement model happens when the measured variables are considered to be the cause of the latent variable.
We have different options, most common ones are:
1. Marker variable: single factor loading constraint to 1
2. Standardized latent variables: setting variance of variable to 1 (Z-score)
3. Effects-coding: constraints that all of the loadings to one LV average 1.0 or that their sum is equal to number of indicators
Combination between measurement and structural model:
Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups:
1. Configural invarience: Model fitted for each group separately
2. Metric invariance: restriction of the factor loadings, but intercepts are allowed to vary
3. Scalar invariance: restriction of the both, factor loadings and intercepts
4. Strict invariance: restriction on factor loadings, intercepts and residual variances