class: center background-image: url("mainFinal2.png") --- <style type="text/css"> body, td { font-size: 15px; } code.r{ font-size: 15px; } pre { font-size: 20px } .huge .remark-code { /*Change made here*/ font-size: 200% !important; } .tiny .remark-code { /*Change made here*/ font-size: 80% !important; } </style> ## Press record --- ## Logistics - What are we planning to talk about : https://nemanjavaci.netlify.app/advanced-stats-course/course-handbook/<br/> <br/> a) Generalized linear models <br/> b) Structural equation modelling: path models and confirmatory factor analysis <br/> c) Mixed-effects models: cross-sectional and longitudinal data <br/> <br/> - Theory and practice (1 + 1 hour):<br/><br/> a) Theoretical aspect – why and when would we want to use a certain statistical model <br/> b) Mathematical aspect – the mathematical basis of the model and how can we transform the data and parameters <br/> c) Practical aspect – using R to analyse the data and build the statistical models on real-world data <br/> <br/> - R statistical environment <br/> <br/> - Materials:<br/> a) Presentations (press __p__ for additional content)<br/> b) Commented R code ([link](https://nemanjavaci.netlify.app/advanced-stats-course/psy6210-r-code/) for this lecture)<br/> c) Slack channel to discuss statistical questions<br/> d) Course glossary: [link](https://nemanjavaci.netlify.app/advanced-stats-course/psy6210-glossary/) ??? General overview of the course<br/> <br/> Focus on theory and applications, but also on how to use these models in the R environment<br/> <br/> Each lecture is planned to be supported by HTML presentation with main slide content and lecturer's notes + commented code that produced outputs in the presentation. Often, I will include additional content that you can explore, but it is not going to be used for any subsequent knowledge assessments. --- ## Knowledge assessment ####The application of statistical methods to the existing data A series of research questions for which you will have to propose a statistical model that tests hypothetical assumptions, motivate your models, build them in the R statistical environment, and interpret results (parameters, fit and critique of the model) Homework: 10% of the final mark <br/> Final essay type exam: 90% of the final mark The exam will focus on the: <br/> a) Theoretical aspect <br/> b) Mathematical aspect <br/> c) Practical aspect --- ## Opportunities for feedback: Discussions during the classes <br/> Feedback on the homework <br/> Slack channel and Feedback form <br/> Course communication: n.vaci@sheffield.ac.uk <br/> Office hours: Friday from 1 to 2 pm (https://tinyurl.com/y2ptqmwx)<br/> Post your questions on the spreadsheet queries --- ## Today: linear regression Intended learning outcomes: <br/> Explain and motivate the linear regression model and how it relates to other statistical methods <br/> Build a model; analyse and interpret the coefficients (model with one predictor, categorical predictors, multiple predictors, and interactions) <br/> Interpret other information that linear regression reports: determination coefficient, F-test, residuals <br/> Evaluate fit and assumptions of the linear regression model <br/> --- class: center, inverse background-image: url("main.gif") --- ## Why do we use statistical models? -- <img src="Distributions.jpeg" width="40%" style="display: block; margin: auto;" /> Full image: [link](https://miro.medium.com/max/1400/1*NhLlwFMzN0yWSvhipqMgfw.jpeg) --- ## Gaussian (normal) distribution `$$y_ \sim \mathcal{N}(\mu,\sigma)$$` <img src="normalD.png" width="90%" style="display: block; margin: auto;" /> ??? Gaussian distribution is described with two parameters: mean (measure of central tendency) and dispersion around the mean (standard deviation). Central limit theorem: [link](https://en.wikipedia.org/wiki/Central_limit_theorem)<br/><br/> What can we do with one distribution: [Moments](https://en.wikipedia.org/wiki/Moment_(mathematics), [Normality](https://en.wikipedia.org/wiki/Normality_test#:~:text=In%20statistics%2C%20normality%20tests%20are,set%20to%20be%20normally%20distributed.)<br/> What can we do with two? With three or more? <br/> --- ## Linear regression Method that summarises how the average values of numerical outcome variable vary over values defined by linear functions of predictors ([visualisation](https://seeing-theory.brown.edu/regression-analysis/index.html#section1)) <br/> <br/> -- $$ y = \alpha + \beta * x + \epsilon $$ -- <img src="Week1_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> ??? The mathematical equation that estimates a dependent variable Y from a set of predictor variables or regressors X.<br/> Each regressor gets a weight/coefficient that is used to estimate Y. This coefficient is estimated according to some criteria, eg. minimises the sum of squared errors or maximises the logarithm of the likelihood function. Additional reading: <br/> [Estimation of the parameters](https://data.princeton.edu/wws509/notes/c2s2) <br/><br/> `\(\epsilon\)` - measure of accuracy of the model<br/> `\(SS_{residual}=\sum_{i=1}^{n}(Y_{i}-\hat{Y}_{i})^2=\sum_{i=1}^{n}e_{i}^2\)`<br/> <br/> Other options: https://en.wikipedia.org/wiki/Deming_regression --- ## Example Predict the height (cm) of babies from their age (months), weight (grams) and sex: ```r par(mfrow=c(1,3), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(Babies$Age, Babies$Height, xlab='Age (months)', ylab='Height (cm)') plot(Babies$Weight, Babies$Height, xlab='Weight (grams)', ylab='Height (cm)') boxplot(Babies$Height~Babies$Sex,xlab='Sex', ylab='Height (cm)') ``` <img src="Week1_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- ## Interpretation: one predictor $$ y = \alpha + \beta * Age + \epsilon $$ ```r lm1<-lm(Height~Age, data=Babies) lm1$coefficients ``` ``` ## (Intercept) Age ## 57.025804 0.143174 ``` Counterfactual interpretation (causal): Increase of 1 unit in predictor value - Age (being one month older) changes the outcome by the `\(\beta\)` (model estimated value) <br/> <br/> Predictive interpretation (descriptive): If we compare babies that differ in their age by 1 month, we expect to see that older babies are taller by `\(\beta\)` on average <br/> <br/> ??? Model summarises the difference in average Height between the children as a function of their Age. <br/> <br/> Intercept is the average (predicted) score for children at birht (0 - months)<br/> <br/> Type of interpretation usually depends on the strength of the theoretical framework and methodological design. If you have well developed theoretical assumptions and experimental design, this perhaps allows you to use counterfactual interpretation. --- ## Interpretation: multiple predictors `$$y = \alpha + \beta_1 * Age + \beta_2 * Weight + \epsilon$$` Interpretation becomes contingent on other variables in the model <br/> <br/> ```r lm2<-lm(Height~Age+Weight, data=Babies) lm2$coefficients ``` ``` ## (Intercept) Age Weight ## 40.1134418 0.1327346 0.0042808 ``` Age: Comparing babies that have same Weight, but differ in their Age by one month, the model predicts difference by a value of `\(\beta_1\)` in their Height on average<br/> <br/> Weight: Comparing babies that have same Age, but differ on their Weight by one gram, the model predicts difference by a value of `\(\beta_2\)` in their Height on average Regression coefficient is a [partial correlation](https://en.wikipedia.org/wiki/Partial_correlation) estimate ??? ```r AgeRes=residuals(lm(Age~Weight, data=Babies)) HeightRes=residuals(lm(Height~Weight, data=Babies)) lm(HeightRes~AgeRes)$coefficients ``` ``` ## (Intercept) AgeRes ## 1.808264e-17 1.327346e-01 ``` --- ## Interpretation: multiple predictors `$$y = \alpha + \beta_1 * Age + \beta_2 * Sex + \epsilon$$` Interpretation becomes contingent on other variables in the model <br/> <br/> ```r lm2<-lm(Height~Age+Sex, data=Babies) lm2$coefficients ``` ``` ## (Intercept) Age SexBoys ## 57.5038799 0.1403955 -0.8309449 ``` Age: Comparing babies that have identical Sex, but differ in their Age by one month, the model predicts difference by a value of `\(\beta_1\)` in their Height on average<br/> <br/> Sex: Comparing babies that have same Age, but have different Sex, the model predicts difference by a value of `\(\beta_2\)` in their Height on average <br/> <br/> ??? The model summarises difference in average Height between the children as a function of their Age and Sex. <br/> <br/> Intercept is the average (prediction) score for girls (0 on Sex variable) and that have 0 months (birth). --- ## Categorical predictors `$$y = \alpha + \beta_1 * Age + \beta_2 * Sex_{Boys} + \epsilon$$` What is the model doing: .center[ <img src="CatPred.png", width = "30%"> ] Each level is assigned a value: Girls - 0, Boys - 1 <br/> <br/> The slope coefficient `\(\beta_2\)` models the difference between the two levels ??? Additional reading on type of dummy coding in R: [link](https://stats.idre.ucla.edu/r/modules/coding-for-categorical-variables-in-regression-models/) --- ## Interpretation: interactions `$$y = \alpha + \beta_1 * Age + \beta_2 * Sex_{Boys} + \beta_3 * Age*Sex_{Boys} + \epsilon$$` We allow the slope of age to linearly change across the subgroups of Sex variable<br/> <br/> ```r lm3<-lm(Height~Age*Sex, data=Babies) lm3$coefficients ``` ``` ## (Intercept) Age SexBoys Age:SexBoys ## 56.1448603 0.2202400 1.8315868 -0.1607307 ``` Age: In the case of girls, comparing difference in babies older by month, the model predicts average difference by `\(\beta_1\)` coefficient<br/> <br/> Sex: Expected difference between girls at birth and boys at birth is `\(\beta_2\)` coefficient<br/> <br/> Age*Sex: Difference in the Age slope for Girls and Boys ??? Intercept is the predicted Girls Hight at birth <br/> <br/> Predicted Height for Boys, when Age is 0: Intercept + `\(Sex_{boys}\)` <br/> <br/> Slope for the Boys: Age + Age: `\(Sex_{Boys}\)` --- ## Interpretation: interactions What about by-linear continous interactions? `$$y = \alpha + \beta_1 * Age + \beta_2 * Weight + \beta_3 * Age*Weight + \epsilon$$` ```r lm4<-lm(Height~Age*Weight, data=Babies) lm4$coefficients ``` ``` ## (Intercept) Age Weight Age:Weight ## 30.7244904854 0.6913148965 0.0066360329 -0.0001397745 ``` Age: Keeping weight at value of 0 and comparing babies that differ by 1 month in their age, the model predicts average difference of `\(\beta_1\)` Weight: Keeping age at value of 0 (birth) and comparing babies that differ by 1 gram in their weight, the model predicts average difference of `\(\beta_2\)` Age*Weight: The average difference between babies that differ by 1 month in their age, changes by `\(\beta_3\)` with every 1 gram change in babies weight ??? By including an interaction we allow a model to be fit differently to subsets of data.<br/> <br/> When to test the interactions: does the theory/clinical practice/previous studies postulate possibilities of interaction existing between your main effects? If yes, then proceed with caution. <br/> <br/> + a good sign is when the main effects are relatively strong and large and they explain a large amount of variation in the data --- ## What other information do we get from linear model? ```r lm1<-lm(Height~Age, data=Babies) summary(lm1) ``` ``` ## ## Call: ## lm(formula = Height ~ Age, data = Babies) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.4765 -4.1601 -0.3703 3.9198 12.3842 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 57.02580 1.18751 48.021 <2e-16 *** ## Age 0.14317 0.06426 2.228 0.0282 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.283 on 98 degrees of freedom ## Multiple R-squared: 0.04821, Adjusted R-squared: 0.0385 ## F-statistic: 4.964 on 1 and 98 DF, p-value: 0.02817 ``` ??? Residuals - estimate of the error of the model <br/> <br/> Coefficients with standard errors, as well as t-test and significance values. Test statistic (t-test) is used to test the significance of the predictor against 0. The reason why is it approximated with t distribution: [Link](https://stats.stackexchange.com/a/344008)<br/> <br/> Residual standard error - estimate of the fit of our model: `\(\sqrt(\frac{SS_{residual}}{df})\)` --- ## Determination coefficient `\(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)` ```r ggplot()+geom_linerange(data=Babies,aes(x=Age, ymin=Height,ymax=lm1,colour=diff), size=1.2)+geom_line(data=Babies,aes(x=Age, y=lm1), size=1.2)+geom_point(data=Babies, aes(Age, y=Height), size=2)+ylab('Height')+xlab('Age')+ggtitle('SS_residual')+theme(axis.title=element_text(size=14), axis.text =element_text(size=12)) ``` <img src="Week1_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ??? Residual sum of squares: calculating vertical distances between observed data points and fitted regression line, raising them to the power of 2 (squaring) and summing them (addition). THe residual sum of squares informs us on the unexplained variance in our data - variance in the error term. --- ## Determination coefficient `\(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)` ```r lm0<-lm(Height~1, data=Babies) summary(lm0) ``` ``` ## ## Call: ## lm(formula = Height ~ 1, data = Babies) ## ## Residuals: ## Min 1Q Median 3Q Max ## -16.4165 -4.2284 -0.2062 3.6744 13.5940 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 59.3953 0.5388 110.2 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.388 on 99 degrees of freedom ``` ??? We can fit only intercept model - only mean level of the dependent variable --- ## Determination coefficient `\(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)` ```r ggplot()+geom_linerange(data=Babies,aes(x=Age, ymin=Height,ymax=lm0,colour=diff2), size=1.2)+geom_line(data=Babies,aes(x=Age, y=lm0), size=1.2)+geom_point(data=Babies, aes(Age, y=Height), size=2)+ylab('Height')+xlab('Age')+ggtitle('SS_total')+theme(axis.title=element_text(size=14), axis.text =element_text(size=12)) ``` <img src="Week1_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> ??? Total sum of squares: calculating vertical distances between observed data points and null model regression line (only intercept - mean of dependent variable), raising them to the power of 2 (squaring) and summing them (addition). The total sum of squares informs us on the total variance in our outcome/dependent variable. --- ## Improvement in our prediction? `\(F = \frac{SS_m/df_m}{SS_r/df_r}\)` ```r ggplot()+geom_linerange(data=Babies,aes(x=Age, ymin=lm1,ymax=lm0,colour=diff3), size=1.2)+geom_line(data=Babies,aes(x=Age, y=lm0), size=1.2)+geom_line(data=Babies, aes(Age, y=lm1), size=1.3, linetype='dotdash')+geom_point(data=Babies, aes(x=Age, y=Height), size=0.9)+ylab('Height')+xlab('Age')+ggtitle('Improvement')+theme(axis.title=element_text(size=14), axis.text =element_text(size=12)) ``` <img src="Week1_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ??? We can also calculate an F-test by hand. First we would need to calculate the amount of variance in the data that was explained by our model. We would get that by calculating distances between our fitted data points (fitted line) and null model line (mean level of dependent variable). We would square those distances and sum them up to get model/explained sum of squares.<br/><br/><br/> To get an F-test, we would divide model sum of squares by the degrees of freedom spent on estimating the model (dfmodel - number of predictors). We would divide that by another unexplained amount of variance (residual sum of squares divided by residual degrees of freedom -> n - k +1 ). --- ## Why is this important? The general linear model is "basis" for all other models covered by this course <br/> <br/> In particular, more complex statistical models are often just generalisations of the general linear model <br/> <br/> Other statistical model, such as correlations, t-test, ANOVA, ANCOVA can be expressed and estimated within the general linear model <br/> <br/> ```r cor(Babies$Height,Babies$Weight) ``` ``` ## [1] 0.3879701 ``` ```r Babies$HeightStand=scale(Babies$Height, scale=TRUE, center=TRUE) Babies$WeightStand=scale(Babies$Weight, scale=TRUE, center=TRUE) lmCorr<-lm(HeightStand~-1+WeightStand, data=Babies) lmCorr$coefficients ``` ``` ## WeightStand ## 0.3879701 ``` ??? General linear model is a special case of the broad family of models (Generalized Linear Models - more next week) --- ## Asumptions a) Error distribution: `\(\mathcal{N}^{iid}(0,\sigma^2)\)` <br/> <br/> b) Linearity and additivity <br/> <br/> c) Validity <br/> <br/> d) Lack of perfect multicolinearity <br/> <br/> ??? 0 being mean of residuals is a by-product of OLS when we include intercept. <br/> Assumptioms of IID ~ independently and identically distributed residuals, are often problematic in psychology, when we have clustered answers. <br/><br/> Linearity - outcome can be modelled as a linear function of separate predictors<br/> Additivity - postulated linear function should not vary across values of other variables, if it does - we need to add interactions<br/> Validity - data that you are measuring should reflect the phenomenon you are interested in<br/> --- class: inverse, middle, center # Practical aspect --- #Practical work Example dataset: Income inequality and rates of hate crimes - [Article](https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/) and [Data](https://github.com/fivethirtyeight/data/tree/master/hate-crimes) Reading local files to R: ```r inequality<-read.table('inequality.txt',sep=',', header=T) knitr::kable(head(inequality[,c(1,2,8,12)]), format = 'html') ``` <table> <thead> <tr> <th style="text-align:left;"> state </th> <th style="text-align:right;"> median_household_income </th> <th style="text-align:right;"> gini_index </th> <th style="text-align:right;"> avg_hatecrimes_per_100k_fbi </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Alabama </td> <td style="text-align:right;"> 42278 </td> <td style="text-align:right;"> 0.472 </td> <td style="text-align:right;"> 1.8064105 </td> </tr> <tr> <td style="text-align:left;"> Alaska </td> <td style="text-align:right;"> 67629 </td> <td style="text-align:right;"> 0.422 </td> <td style="text-align:right;"> 1.6567001 </td> </tr> <tr> <td style="text-align:left;"> Arizona </td> <td style="text-align:right;"> 49254 </td> <td style="text-align:right;"> 0.455 </td> <td style="text-align:right;"> 3.4139280 </td> </tr> <tr> <td style="text-align:left;"> Arkansas </td> <td style="text-align:right;"> 44922 </td> <td style="text-align:right;"> 0.458 </td> <td style="text-align:right;"> 0.8692089 </td> </tr> <tr> <td style="text-align:left;"> California </td> <td style="text-align:right;"> 60487 </td> <td style="text-align:right;"> 0.471 </td> <td style="text-align:right;"> 2.3979859 </td> </tr> <tr> <td style="text-align:left;"> Colorado </td> <td style="text-align:right;"> 60940 </td> <td style="text-align:right;"> 0.457 </td> <td style="text-align:right;"> 2.8046888 </td> </tr> </tbody> </table> ??? In order to read/load the data in R, you need to specify where R is supposed to look for the files. One way you can achieve this is by setting up the working directory. You can click on the tab __Session__ then __Set working directory__ then __Choose directory__ and then navigate to folder where your data is. Then you can just follow the code outlined in the presentations and/or in the R code for this lecture. --- ##Dataset: outcome measures Probability density plots: probability of the random variable falling within a range of values <br/> ```r par(mfrow=c(1,2), bty='n',mar = c(5, 4, 1, .1), cex=1.1, pch=16) plot(density(inequality$hate_crimes_per_100k_splc, na.rm = T), main='Crimes per 100k') plot(density(inequality$avg_hatecrimes_per_100k_fbi, na.rm = T), main='Average Crimes') ``` <img src="Week1_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ??? Shape of the distribution, the most likely range of values, the spread of values, modality etc. --- ##Dataset: some of the predictors ```r par(mfrow=c(1,2), bty='n',mar = c(5, 4, 1, .1), cex=1.1, pch=16) plot(density(inequality$median_household_income, na.rm = T), main='Income') plot(density(inequality$gini_index, na.rm = T), main='Gini') ``` <img src="Week1_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> ??? We do not have distributional expectations in the case of predictors, however it is always a good practice to see density functions, histograms, scatter plots, and descriptive statistics as they might reveal mistakes in the data, non-meaningful data points, or just outliers and leverage values. --- ##Bivariate plots Scatter plots: ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(inequality$median_household_income, inequality$avg_hatecrimes_per_100k_fbi, xlab='Median household income',ylab='Avg hatecrimes') ``` <img src="Week1_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- ##Bivariate plots Scatter plots: ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(inequality$gini_index, inequality$avg_hatecrimes_per_100k_fbi,xlab='Gini index',ylab='Avg hatecrimes') ``` <img src="Week1_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- ##Correlations in the data ```r names(inequality)[c(2,8,12)]=c('Median_income','Gini','Avg_hate') cor(inequality[,c(2,8,12)], use="complete.obs") ``` ``` ## Median_income Gini Avg_hate ## Median_income 1.0000000 -0.1497444 0.3182464 ## Gini -0.1497444 1.0000000 0.4212719 ## Avg_hate 0.3182464 0.4212719 1.0000000 ``` ??? In this situation, I am primarily using correlation to summarise collected data. High correlations might prepare us for collinearity issues in regressions, unexpected changes in the coefficients etc. --- ## Linear regression in R ```r mod1<-lm(Avg_hate~Median_income, data=inequality) summary(mod1) ``` ``` ## ## Call: ## lm(formula = Avg_hate ~ Median_income, data = inequality) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.3300 -1.1183 -0.1656 0.7827 7.7762 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -9.564e-01 1.448e+00 -0.661 0.5121 ## Median_income 6.054e-05 2.603e-05 2.326 0.0243 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.642 on 48 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.1013, Adjusted R-squared: 0.08256 ## F-statistic: 5.409 on 1 and 48 DF, p-value: 0.0243 ``` ??? In this model, we analyse average hate reported by fbi as a function of median income of the population. We are asking does knowing median income of the population help us to say what is average hate better than just knowing mean level of hate. --- ## Adding more predictors Model comparison works beyond null - intercept only model and models with one or more predictors <br/> <br/> ```r mod2<-lm(Avg_hate~Median_income+Gini, data=inequality) anova(mod2) ``` ``` ## Analysis of Variance Table ## ## Response: Avg_hate ## Df Sum Sq Mean Sq F value Pr(>F) ## Median_income 1 14.584 14.584 7.0649 0.0107093 * ## Gini 1 32.389 32.389 15.6906 0.0002517 *** ## Residuals 47 97.020 2.064 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ??? We can add a second predictor - Gini index and calculate anova table using __anova__ function. We are comparing the model that we have build with the null model (only intercept model) --- ## Adding more predictors We could use stepwise approach and build our models by sequentially adding the predictors This allows us to see relative improvement that we get by adding gini index to the model with median household income ```r anova(mod1,mod2) ``` ``` ## Analysis of Variance Table ## ## Model 1: Avg_hate ~ Median_income ## Model 2: Avg_hate ~ Median_income + Gini ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 48 129.41 ## 2 47 97.02 1 32.389 15.691 0.0002517 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ??? When given more than one model __anova__ functions compares the models agains each other, in order specified. --- ## Main effects for two predictors ```r summary(mod2) ``` ``` ## ## Call: ## lm(formula = Avg_hate ~ Median_income + Gini, data = inequality) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.2251 -0.9284 -0.2918 0.9861 4.5301 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.959e+01 4.871e+00 -4.021 0.000208 *** ## Median_income 7.421e-05 2.304e-05 3.221 0.002319 ** ## Gini 3.936e+01 9.938e+00 3.961 0.000252 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.437 on 47 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.3262, Adjusted R-squared: 0.2975 ## F-statistic: 11.38 on 2 and 47 DF, p-value: 9.337e-05 ``` --- ## Interactions? Should median household income be adjusted across the levels of gini index (or vice versa)? ```r mod3<-lm(Avg_hate~Median_income*Gini, data=inequality) anova(mod1,mod2, mod3) ``` ``` ## Analysis of Variance Table ## ## Model 1: Avg_hate ~ Median_income ## Model 2: Avg_hate ~ Median_income + Gini ## Model 3: Avg_hate ~ Median_income * Gini ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 48 129.41 ## 2 47 97.02 1 32.389 19.742 5.539e-05 *** ## 3 46 75.47 1 21.550 13.135 0.0007218 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ??? Interaction between variables is an easy thing to add. You can use multiplication (*) - Median_income * Gini will result in two main effects (Median_income and Gini index) and one interaction (between two main effects). <br/><br/> You can include only an interaction by using this specification of the model regressors -> Median_income:Gini. This will result in only interaction model, without main effects. However, this is not something that you would do often. --- ## Interactions! ``` ## ## Call: ## lm(formula = Avg_hate ~ Median_income * Gini, data = inequality) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.16383 -0.96279 -0.01053 0.88008 2.75763 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.056e+01 3.070e+01 2.950 0.004986 ** ## Median_income -1.724e-03 4.965e-04 -3.472 0.001136 ** ## Gini -2.001e+02 6.666e+01 -3.002 0.004330 ** ## Median_income:Gini 3.907e-03 1.078e-03 3.624 0.000722 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.281 on 46 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.4759, Adjusted R-squared: 0.4417 ## F-statistic: 13.92 on 3 and 46 DF, p-value: 1.367e-06 ``` --- ## Visualisations of the predictions ```r require(interactions) interact_plot(mod3, pred=Median_income, modx=Gini, plot.points = T) ``` <img src="Week1_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> --- ## Visualisations of the predictions ```r p<-ggplot(simulatedD, aes(Median_income, Avg_hate_pred, color=Gini,frame=Gini))+geom_point() plotly::ggplotly(p) ```
--- ## Model diagnostics ```r summary(mod3$residuals) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.16383 -0.96279 -0.01053 0.00000 0.88008 2.75763 ``` Studentized residuals `\(\epsilon_{Ti}=\frac{\epsilon_i}{\sigma_{(-i)}\sqrt{1-h_i}}\)` ??? Studentized residual: division of a residual by an estimate of its standard devitation weighted by an estimate of leverage (hat values) <br/> <br/> Delete the observations one at the time and fit the model on the n-1 observations --- class: center, inverse background-image: url("overfit.jpg") --- ## Normality ```r car::qqPlot(mod3) ``` <img src="Week1_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> ``` ## [1] 9 18 ``` [Cross-validated on residuals](https://stats.stackexchange.com/a/101290) and [Cross-validated shiny](https://xiongge.shinyapps.io/QQplots/) ??? We would expect that our residuals follow the diagonal line and do not deviate from it considerably - outside of the intervals --- ## Homoscedasticity ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) car:: residualPlot(mod3, type='rstudent') ``` <img src="Week1_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ??? We would equal spread of residuals across the fitted values - to follow horizontal line and are equidistant from it. --- ## Outliers and leverage points ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) car:: influenceIndexPlot(mod3, vars=c("Cook","studentized")) ``` <img src="Week1_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> ??? Cook's Distance of observation __i__ is defined as a sum of all the changes in the regression model when observation __i__ is removed from it - > indicate data points that change the behaviour of the model when removed (influential data points). <br/><br/> The studentized residuals estimate varying of residuals when the Y data point is deleted and they provide us with information on outliers. --- ## Outliers and leverage points ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) car:: influenceIndexPlot(mod3, vars=c("Bonf","hat")) ``` <img src="Week1_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- ## Autocorrelation between the residuals ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) stats::acf(resid(mod3)) ``` <img src="Week1_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> ??? We also want to check the autocorrelation between our residuals. This plot shows lagged autocorrelation, where at position 0 is correlation of residual with itself, and then with subsequent residuals. --- ## Observed versus predicted values ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(predict(mod3),mod3$model$Avg_hate, xlab='Predicted values (model 3)', ylab='Observed values (avg hatecrimes)') ``` <img src="Week1_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> ??? [Further reading](https://www.sagepub.com/sites/default/files/upm-binaries/38503_Chapter6.pdf)<br/> <br/> For maybe better package when checking the model assumptions try: <br/> require(performance)<br/> check_model(mod3)<br/> --- ## Model evaluation Rerun the model excluding leverage values .tiny[ ```r mod3.cor<-lm(Avg_hate~Median_income*Gini, data=inequality, subset=-9) summary(mod3.cor) ``` ``` ## ## Call: ## lm(formula = Avg_hate ~ Median_income * Gini, data = inequality, ## subset = -9) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.90964 -0.94966 -0.08526 0.73470 2.55257 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.429e+01 3.080e+01 1.438 0.157 ## Median_income -7.992e-04 5.228e-04 -1.529 0.133 ## Gini -9.668e+01 6.725e+01 -1.438 0.157 ## Median_income:Gini 1.837e-03 1.145e-03 1.604 0.116 ## ## Residual standard error: 1.154 on 45 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.1285, Adjusted R-squared: 0.07041 ## F-statistic: 2.212 on 3 and 45 DF, p-value: 0.09974 ``` ] --- ## Model evaluation Interaction plot .tiny[ ```r mod2.cor<-lm(Avg_hate~Median_income+Gini, data=inequality, subset=-9) summary(mod2.cor) ``` ``` ## ## Call: ## lm(formula = Avg_hate ~ Median_income + Gini, data = inequality, ## subset = -9) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.86717 -0.97806 -0.05752 1.05394 2.50740 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.491e+00 5.015e+00 -0.895 0.3752 ## Median_income 3.906e-05 2.012e-05 1.942 0.0583 . ## Gini 1.005e+01 1.005e+01 1.000 0.3226 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.174 on 46 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.07866, Adjusted R-squared: 0.0386 ## F-statistic: 1.964 on 2 and 46 DF, p-value: 0.152 ``` ] --- ## Take away message (important bits) Linear model - continuous and categorical predictors <br/> <br/> Interpreting estimated coefficients <br/> <br/> Interpreting results (significance, determination coefficient, F-test, residuals) <br/> <br/> Understanding assumptions of linear model <br/> <br/> Using linear models in R: overall and stepwise approach <br/> <br/> Model diagnostics based on pre and post-modelling procedures covered by the lecture and beyond --- ## Literature First and second chapter of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill <br/> <br/> First three chapter of "Regression Analysis and Linear Models: Concepts, Applications, and Implementation" by Richard B. Darlington and Andrew F. Hayes <br/> <br/> van Rij, J., Vaci, N., Wurm, L. H., & Feldman, L. B. (2020). Alternative quantitative methods in psycholinguistics: Implications for theory and design. Word Knowledge and Word Usage, 83-126. --- ## Exercise for the next week 1. Fill the reflection and feedback form by Monday: https://forms.gle/DoeBNKZqCeRvSmZ46 <br/><br/> 2. Go over the practical aspect of the lecture: read the data in R, plot uni- and bi-variate plots, calculate descriptive statistics, make a regression model, and evaluate it's fit <br/><br/> 3. Think about how you could make a model where you have high or a perfect multicolinearity between two predictors <br/><br/> 4. How would you create a model where you have `\(R^2=1\)` --- # Thank you for your attention [Song for the weekend](https://youtu.be/pZTLFu79UbY)