class: center background-image: url("main2.png") ## Lecture 1: Outline of the planned work + linear regression <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> ### Dr Nemanja Vaci 12/02/2021 --- <style type="text/css"> body, td { font-size: 15px; } code.r{ font-size: 15px; } pre { font-size: 20px } .huge .remark-code { /*Change made here*/ font-size: 200% !important; } .tiny .remark-code { /*Change made here*/ font-size: 80% !important; } </style> ## Press record --- ## Background .pull-left[ BA in <b> Psychology </b><br/> MSC in <b> Applied Statistics</b><br/> (Novi Sad, Serbia) <br/> <br/> Phd in <b> Psychology </b> <br/> (Klagenfurt, Austria) <br/> <br/> Postdoc in <b>Data Science/Bioinformatics</b><br/> (Oxford, UK) <br/> <br/> Departments: Psychology, Applied Statistics, Psychiatry ] .pull-right[ <img src="Places.png", width = "120%"> ] --- ## Logistics - Lectures will be hosted on : https://github.com/nvaci/Advanced-Statistics - 7 lectures (5 different statistical procedures + additional content) <br/> <br/> a) Generalized linear models <br/> b) Structural equation modelling: path models and confirmatory factor analysis <br/> c) Mixed-effects models: cross-sectional and longitudinal data <br/> <br/> - Theory and practice <br/> <br/> - Data, statistical analysis and commented code <br/> <br/> - R as a statistical environment <br/> <br/> - Practical exam (more information in a couple of weeks) <br/> <br/> <br/> <br/> Code for this lecture: [link](https://nvaci.github.io/Lecture_1_code/Lecture_1_Code.html) ??? General overview of the course<br/> <br/> Focus on theory and applications, but also on how to use these models in the R environment<br/> <br/> Each lecture is planned to be supported by HTML presentation with main slide content and lecturer's notes + commented code that produced outputs in the presentation. Often, I will include additional content that you can explore, but it is not going to be used for any subsequent knowledge assessments. --- class: center, inverse background-image: url("main.gif") --- ## Linear regression Method that summarises how the average values of numerical outcome variable vary over values defined by linear functions of predictors ([visualisation](https://seeing-theory.brown.edu/regression-analysis/index.html#section1)) <br/> <br/> -- $$ y = \alpha + \beta * x + \epsilon $$ -- <img src="Index_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> ??? The mathematical equation that estimates a dependent variable Y from a set of predictor variables or regressors X.<br/> Each regressor gets a weight/coefficient that is used to estimate Y. This coefficient is estimated according to some criteria, eg. minimises the sum of squared errors or maximises the logarithm of the likelihood function. Additional reading: <br/> [Estimation of the parameters](https://data.princeton.edu/wws509/notes/c2s2) <br/><br/> `\(\epsilon\)` - measure of accuracy of the model<br/> `\(SS_{residual}=\sum_{i=1}^{n}(Y_{i}-\hat{Y}_{i})^2=\sum_{i=1}^{n}e_{i}^2\)` --- ## Example Predict the height (cm) of babies from their age (months), weight (grams) and gender: ```r par(mfrow=c(1,3), bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(Babies$Age, Babies$Height, xlab='Age (months)', ylab='Height (cm)') plot(Babies$Weight, Babies$Height, xlab='Weight (grams)', ylab='Height (cm)') boxplot(Babies$Height~Babies$Gender,xlab='Gender', ylab='Height (cm)') ``` <img src="Index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Interpretation: one predictor $$ y = \alpha + \beta * Age + \epsilon $$ ```r lm1<-lm(Height~Age, data=Babies) lm1$coefficients ``` ``` ## (Intercept) Age ## 57.025804 0.143174 ``` Counterfactual interpretation (causal): Increase of 1 unit in predictor value - Age (being one month older) changes the outcome by the `\(\beta\)` (model estimated value) <br/> <br/> Predictive interpretation (descriptive): If we compare babies that differ in their age by 1 month, we expect to see that older babies are taller by `\(\beta\)` on average <br/> <br/> ??? Model summarises the difference in average Height between the children as a function of their Age. <br/> <br/> Intercept is the average (predicted) score for children at birht (0 - months)<br/> <br/> Type of interpretation usually depends on the strength of the theoretical framework and methodological design. If you have well developed theoretical assumptions and experimental design, this perhaps allows you to use counterfactual interpretation. --- ## Interpretation: multiple predictors `$$y = \alpha + \beta_1 * Age + \beta_2 * Gender + \epsilon$$` Interpretation becomes contingent on other variables in the model <br/> <br/> ```r lm2<-lm(Height~Age+Gender, data=Babies) lm2$coefficients ``` ``` ## (Intercept) Age GenderBoys ## 57.5038799 0.1403955 -0.8309449 ``` Age: Comparing babies that have identical Gender, but differ in their Age by one month, the model predicts difference by a value of `\(\beta_1\)` in their Height on average<br/> <br/> Gender: Comparing babies that have same Age, but have different Gender, the model predicts difference by a value of `\(\beta_2\)` in their Height on average <br/> <br/> ??? The model summarises difference in average Height between the children as a function of their Age and Gender. <br/> <br/> Intercept is the average (prediction) score for girls (0 on Gender variable) and that have 0 months in their Age (birth). --- ## Categorical predictors `$$y = \alpha + \beta_1 * Age + \beta_2 * Gender_{Boys} + \epsilon$$` What is the model doing: .center[ <img src="CatPred.png", width = "30%"> ] Each level is assigned a value: Girls - 0, Boys - 1 <br/> <br/> The slope coefficient `\(\beta_2\)` models the difference between the two levels ??? Additional reading on type of dummy coding in R: [link](https://stats.idre.ucla.edu/r/modules/coding-for-categorical-variables-in-regression-models/) --- ## Interpretation: interactions `$$y = \alpha + \beta_1 * Age + \beta_2 * Gender_{Boys} + \beta_3 * Age*Gender_{Boys} + \epsilon$$` We allow the slope of age to linearly change across the subgroups of Gender variable<br/> <br/> ```r lm3<-lm(Height~Age*Gender, data=Babies) lm3$coefficients ``` ``` ## (Intercept) Age GenderBoys Age:GenderBoys ## 56.1448603 0.2202400 1.8315868 -0.1607307 ``` Age: In the case of girls, comparing difference in babies older by month, the model predicts average difference by `\(\beta_1\)` coefficient<br/> <br/> Gender: Expected difference between girls at birth and boys at birth is `\(\beta_2\)` coefficient<br/> <br/> Age*Gender: Difference in the Age slope for Girls and Boys ??? Intercept is the predicted Girls Hight at birth <br/> <br/> Predicted Height for Boys, when Age is 0: Intercept + `\(Gender_{boys}\)`<br/> <br/> Slope for the Boys: Age + Age:$Gender_{Boys}$ --- ## Interpretation: interactions What about by-linear continous interactions? `$$y = \alpha + \beta_1 * Age + \beta_2 * Weight + \beta_3 * Age*Weight + \epsilon$$` ```r lm4<-lm(Height~Age*Weight, data=Babies) lm4$coefficients ``` ``` ## (Intercept) Age Weight Age:Weight ## 30.7244904854 0.6913148965 0.0066360329 -0.0001397745 ``` Age: Keeping weight at value of 0 and comparing babies that differ by 1 month in their age, the model predicts average difference of `\(\beta_1\)` Weight: Keeping age at value of 0 (birth) and comparing babies that differ by 1 gram in their weight, the model predicts average difference of `\(\beta_2\)` Age*Weight: The average difference between babies that differ by 1 month in their age, changes by `\(\beta_3\)` with every 1 gram change in babies weight ??? By including an interaction we allow a model to be fit differently to subsets of data.<br/> <br/> When to test the interactions: does the theory/clinical practice/previous studies postulate possibilities of interaction existing between your main effects? If yes, then proceed with caution. <br/> <br/> + a good sign is when the main effects are relatively strong and large and they explain a large amount of variation in the data --- ## What other information do we get from linear model? ```r lm1<-lm(Height~Age, data=Babies) summary(lm1) ``` ``` ## ## Call: ## lm(formula = Height ~ Age, data = Babies) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.4765 -4.1601 -0.3703 3.9198 12.3842 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 57.02580 1.18751 48.021 <2e-16 *** ## Age 0.14317 0.06426 2.228 0.0282 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.283 on 98 degrees of freedom ## Multiple R-squared: 0.04821, Adjusted R-squared: 0.0385 ## F-statistic: 4.964 on 1 and 98 DF, p-value: 0.02817 ``` ??? Residuals - we talk about that later <br/> <br/> Coefficients with standard errors, as well as t-test and significance values. Test statistic (t-test) is used to test the significance of the predictor against 0. The reason why is it approximated with t distribution: [Link](https://stats.stackexchange.com/a/344008)<br/> <br/> Residual standard error - estimate of the fit of our model: `\(\sqrt(\frac{SS_{residual}}{df})\)` --- ## Determination coefficient `\(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)` ``` ## Loading required package: ggplot2 ``` ```r ggplot()+geom_linerange(data=Babies,aes(x=Age, ymin=Height,ymax=lm1,colour=diff), size=1.2)+geom_line(data=Babies,aes(x=Age, y=lm1), size=1.2)+geom_point(data=Babies, aes(Age, y=Height), size=2)+ylab('Height')+xlab('Age')+ggtitle('SS_residual')+theme(axis.title=element_text(size=14), axis.text =element_text(size=12)) ``` <img src="Index_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## Determination coefficient `\(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)` ```r lm0<-lm(Height~1, data=Babies) summary(lm0) ``` ``` ## ## Call: ## lm(formula = Height ~ 1, data = Babies) ## ## Residuals: ## Min 1Q Median 3Q Max ## -16.4165 -4.2284 -0.2062 3.6744 13.5940 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 59.3953 0.5388 110.2 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5.388 on 99 degrees of freedom ``` --- ## Determination coefficient `\(R^2 = 1 - \frac{SS_{residual}}{SS_{total}}\)` ```r ggplot()+geom_linerange(data=Babies,aes(x=Age, ymin=Height,ymax=lm0,colour=diff2), size=1.2)+geom_line(data=Babies,aes(x=Age, y=lm0), size=1.2)+geom_point(data=Babies, aes(Age, y=Height), size=2)+ylab('Height')+xlab('Age')+ggtitle('SS_total')+theme(axis.title=element_text(size=14), axis.text =element_text(size=12)) ``` <img src="Index_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## Improvement in our prediction? `\(F = \frac{SS_m/df_m}{SS_r/df_r}\)` ```r ggplot()+geom_linerange(data=Babies,aes(x=Age, ymin=lm1,ymax=lm0,colour=diff3), size=1.2)+geom_line(data=Babies,aes(x=Age, y=lm0), size=1.2)+geom_line(data=Babies, aes(Age, y=lm1), size=1.3, linetype='dotdash')+geom_point(data=Babies, aes(x=Age, y=Height), size=0.9)+ylab('Height')+xlab('Age')+ggtitle('Improvement')+theme(axis.title=element_text(size=14), axis.text =element_text(size=12)) ``` <img src="Index_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Why is this important? The general linear model is "basis" for all other models covered by this course <br/> <br/> In particular, more complex statistical models are often just generalisations of the general linear model <br/> <br/> The same would apply to "simpler" procedures, such as correlations, t-test, ANOVA, ANCOVA etc. <br/> <br/> ```r cor(Babies$Height,Babies$Weight) ``` ``` ## [1] 0.3879701 ``` ```r Babies$HeightStand=scale(Babies$Height, scale=TRUE, center=TRUE) Babies$WeightStand=scale(Babies$Weight, scale=TRUE, center=TRUE) lmCorr<-lm(HeightStand~-1+WeightStand, data=Babies) lmCorr$coefficients ``` ``` ## WeightStand ## 0.3879701 ``` ??? General linear model is a special case of the broad family of models (Generalized Linear Models - more next week) --- ## Asumptions a) Errors: `\(\mathcal{N}^{iid}(0,\sigma^2)\)` <br/> <br/> b) Linearity and additivity <br/> <br/> c) Validity <br/> <br/> d) Lack of perfect multicolinearity <br/> <br/> ??? 0 being mean of residuals is a by-product of OLS when we include intercept. <br/> Assumptioms of IID ~ independently and identically distributed residuals, are often problematic in psychology, when we have clustered answers. <br/><br/> Linearity - outcome can be modelled as a linear function of separate predictors<br/> Additivity - postulated linear function should not vary across values of other variables, if it does - we need to add interactions<br/> Validity - data that you are measuring should reflect the phenomenon you are interested in<br/> --- class: inverse, middle, center # Practical aspect --- #Practical work Example dataset: Income inequality and rates of hate crimes - [Article](https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/) and [Data](https://github.com/fivethirtyeight/data/tree/master/hate-crimes) Reading local files to R: ```r inequality<-read.table('inequality.txt',sep=',', header=T) knitr::kable(head(inequality[,c(1,2,8,12)]), format = 'html') ``` <table> <thead> <tr> <th style="text-align:left;"> state </th> <th style="text-align:right;"> median_household_income </th> <th style="text-align:right;"> gini_index </th> <th style="text-align:right;"> avg_hatecrimes_per_100k_fbi </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Alabama </td> <td style="text-align:right;"> 42278 </td> <td style="text-align:right;"> 0.472 </td> <td style="text-align:right;"> 1.8064105 </td> </tr> <tr> <td style="text-align:left;"> Alaska </td> <td style="text-align:right;"> 67629 </td> <td style="text-align:right;"> 0.422 </td> <td style="text-align:right;"> 1.6567001 </td> </tr> <tr> <td style="text-align:left;"> Arizona </td> <td style="text-align:right;"> 49254 </td> <td style="text-align:right;"> 0.455 </td> <td style="text-align:right;"> 3.4139280 </td> </tr> <tr> <td style="text-align:left;"> Arkansas </td> <td style="text-align:right;"> 44922 </td> <td style="text-align:right;"> 0.458 </td> <td style="text-align:right;"> 0.8692089 </td> </tr> <tr> <td style="text-align:left;"> California </td> <td style="text-align:right;"> 60487 </td> <td style="text-align:right;"> 0.471 </td> <td style="text-align:right;"> 2.3979859 </td> </tr> <tr> <td style="text-align:left;"> Colorado </td> <td style="text-align:right;"> 60940 </td> <td style="text-align:right;"> 0.457 </td> <td style="text-align:right;"> 2.8046888 </td> </tr> </tbody> </table> --- ##Dataset: outcome measures Probability density plots: probability of the random variable falling within a range of values <br/> ```r par(mfrow=c(1,2), bty='n',mar = c(5, 4, 1, .1), cex=1.1, pch=16) plot(density(inequality$hate_crimes_per_100k_splc, na.rm = T), main='Crimes per 100k') plot(density(inequality$avg_hatecrimes_per_100k_fbi, na.rm = T), main='Average Crimes') ``` <img src="Index_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> ??? Shape of the distribution, the most likely range of values, the spread of values, modality etc. --- ##Dataset: some of the predictors ```r par(mfrow=c(1,2), bty='n',mar = c(5, 4, 1, .1), cex=1.1, pch=16) plot(density(inequality$median_household_income, na.rm = T), main='Income') plot(density(inequality$gini_index, na.rm = T), main='Gini') ``` <img src="Index_files/figure-html/unnamed-chunk-20-1.png" style="display: block; margin: auto;" /> --- ##Bivariate plots Scatter plots: ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(inequality$median_household_income, inequality$avg_hatecrimes_per_100k_fbi, xlab='Median household income',ylab='Avg hatecrimes') ``` <img src="Index_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ##Bivariate plots Scatter plots: ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(inequality$gini_index, inequality$avg_hatecrimes_per_100k_fbi,xlab='Gini index',ylab='Avg hatecrimes') ``` <img src="Index_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- ##Correlations in the data ```r cor(inequality[,c(2,8,12)], use="complete.obs") ``` ``` ## median_household_income gini_index ## median_household_income 1.0000000 -0.1497444 ## gini_index -0.1497444 1.0000000 ## avg_hatecrimes_per_100k_fbi 0.3182464 0.4212719 ## avg_hatecrimes_per_100k_fbi ## median_household_income 0.3182464 ## gini_index 0.4212719 ## avg_hatecrimes_per_100k_fbi 1.0000000 ``` ??? In this situation, I am primarily using correlation to summarise collected data. High correlations might prepare us for collinearity issues in regressions, unexpected changes in the coefficients etc. --- ## Linear regression in R ```r mod1<-lm(avg_hatecrimes_per_100k_fbi~median_household_income, data=inequality) summary(mod1) ``` ``` ## ## Call: ## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income, ## data = inequality) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.3300 -1.1183 -0.1656 0.7827 7.7762 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -9.564e-01 1.448e+00 -0.661 0.5121 ## median_household_income 6.054e-05 2.603e-05 2.326 0.0243 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.642 on 48 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.1013, Adjusted R-squared: 0.08256 ## F-statistic: 5.409 on 1 and 48 DF, p-value: 0.0243 ``` --- ## Adding more predictors Model comparison works beyond null - intercept only model and models with one or more predictors <br/> <br/> ```r mod2<-lm(avg_hatecrimes_per_100k_fbi~median_household_income+gini_index, data=inequality) anova(mod2) ``` ``` ## Analysis of Variance Table ## ## Response: avg_hatecrimes_per_100k_fbi ## Df Sum Sq Mean Sq F value Pr(>F) ## median_household_income 1 14.584 14.584 7.0649 0.0107093 * ## gini_index 1 32.389 32.389 15.6906 0.0002517 *** ## Residuals 47 97.020 2.064 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Adding more predictors We could use stepwise approach and build our models by sequentially adding the predictors This allows us to see relative improvement that we get by adding gini index to the model with median household income ```r anova(mod1,mod2) ``` ``` ## Analysis of Variance Table ## ## Model 1: avg_hatecrimes_per_100k_fbi ~ median_household_income ## Model 2: avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 48 129.41 ## 2 47 97.02 1 32.389 15.691 0.0002517 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Main effects for two predictors ```r summary(mod2) ``` ``` ## ## Call: ## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income + ## gini_index, data = inequality) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.2251 -0.9284 -0.2918 0.9861 4.5301 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.959e+01 4.871e+00 -4.021 0.000208 *** ## median_household_income 7.421e-05 2.304e-05 3.221 0.002319 ** ## gini_index 3.936e+01 9.938e+00 3.961 0.000252 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.437 on 47 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.3262, Adjusted R-squared: 0.2975 ## F-statistic: 11.38 on 2 and 47 DF, p-value: 9.337e-05 ``` --- ## Interactions? Should median household income be adjusted across the levels of gini index (or vice versa)? ```r mod3<-lm(avg_hatecrimes_per_100k_fbi~median_household_income*gini_index, data=inequality) anova(mod1,mod2, mod3) ``` ``` ## Analysis of Variance Table ## ## Model 1: avg_hatecrimes_per_100k_fbi ~ median_household_income ## Model 2: avg_hatecrimes_per_100k_fbi ~ median_household_income + gini_index ## Model 3: avg_hatecrimes_per_100k_fbi ~ median_household_income * gini_index ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 48 129.41 ## 2 47 97.02 1 32.389 19.742 5.539e-05 *** ## 3 46 75.47 1 21.550 13.135 0.0007218 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Interactions! ``` ## ## Call: ## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * ## gini_index, data = inequality) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.16383 -0.96279 -0.01053 0.88008 2.75763 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.056e+01 3.070e+01 2.950 0.004986 ** ## median_household_income -1.724e-03 4.965e-04 -3.472 0.001136 ** ## gini_index -2.001e+02 6.666e+01 -3.002 0.004330 ** ## median_household_income:gini_index 3.907e-03 1.078e-03 3.624 0.000722 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.281 on 46 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.4759, Adjusted R-squared: 0.4417 ## F-statistic: 13.92 on 3 and 46 DF, p-value: 1.367e-06 ``` --- ## Visualisations of the predictions ```r require(interactions) interact_plot(mod3, pred=median_household_income, modx=gini_index, plot.points = T) ``` <img src="Index_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ## Visualisations of the predictions ```r p<-ggplot(simulatedD, aes(median_household_income, pred, color=gini_index,frame=gini_index))+geom_point() plotly::ggplotly(p) ```
--- ## Model diagnostics ```r summary(mod3$residuals) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -2.16383 -0.96279 -0.01053 0.00000 0.88008 2.75763 ``` Studentized residuals `\(\epsilon_{Ti}=\frac{\epsilon_i}{\sigma_{(-i)}\sqrt{1-h_i}}\)` ??? Studentized residual: division of a residual by an estimate of its standard devitation weighted by an estimate of leverage (hat values) <br/> <br/> Delete the observations one at the time and fit the model on the n-1 observations --- class: center, inverse background-image: url("overfit.jpg") --- ## Normality ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) car::qqPlot(mod3) ``` <img src="Index_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> ``` ## [1] 9 18 ``` [Cross-validated on residuals](https://stats.stackexchange.com/a/101290) and [Cross-validated shiny](https://xiongge.shinyapps.io/QQplots/) --- ## Homoscedasticity ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) car:: residualPlot(mod3, type='rstudent') ``` <img src="Index_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- ## Outliers and leverage points ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) car:: influenceIndexPlot(mod3, vars=c("Cook","studentized")) ``` <img src="Index_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> ??? Cook's Distance of observation __i__ is defined as a sum of all the changes in the regression model when observation __i__ is removed from it. --- ## Outliers and leverage points ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) car:: influenceIndexPlot(mod3, vars=c("Bonf","hat")) ``` <img src="Index_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> --- ## Autocorrelation between the residuals ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) stats::acf(resid(mod3)) ``` <img src="Index_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> --- ## Observed versus predicted values ```r par(bty='n',mar = c(5, 4, .1, .1), cex=1.1, pch=16) plot(predict(mod3),mod3$model$avg_hatecrimes_per_100k_fbi, xlab='Predicted values (model 3)', ylab='Observed values (avg hatecrimes)') ``` <img src="Index_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> ??? [Further reading](https://www.sagepub.com/sites/default/files/upm-binaries/38503_Chapter6.pdf)<br/> <br/> --- ## Model evaluation Rerun the model excluding leverage values .tiny[ ```r mod3.cor<-lm(avg_hatecrimes_per_100k_fbi~median_household_income*gini_index, data=inequality, subset=-9) summary(mod3.cor) ``` ``` ## ## Call: ## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income * ## gini_index, data = inequality, subset = -9) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.90964 -0.94966 -0.08526 0.73470 2.55257 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 4.429e+01 3.080e+01 1.438 0.157 ## median_household_income -7.992e-04 5.228e-04 -1.529 0.133 ## gini_index -9.668e+01 6.725e+01 -1.438 0.157 ## median_household_income:gini_index 1.837e-03 1.145e-03 1.604 0.116 ## ## Residual standard error: 1.154 on 45 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.1285, Adjusted R-squared: 0.07041 ## F-statistic: 2.212 on 3 and 45 DF, p-value: 0.09974 ``` ] --- ## Model evaluation Interaction plot .tiny[ ```r mod2.cor<-lm(avg_hatecrimes_per_100k_fbi~median_household_income+gini_index, data=inequality, subset=-9) summary(mod2.cor) ``` ``` ## ## Call: ## lm(formula = avg_hatecrimes_per_100k_fbi ~ median_household_income + ## gini_index, data = inequality, subset = -9) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.86717 -0.97806 -0.05752 1.05394 2.50740 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -4.491e+00 5.015e+00 -0.895 0.3752 ## median_household_income 3.906e-05 2.012e-05 1.942 0.0583 . ## gini_index 1.005e+01 1.005e+01 1.000 0.3226 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.174 on 46 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.07866, Adjusted R-squared: 0.0386 ## F-statistic: 1.964 on 2 and 46 DF, p-value: 0.152 ``` ] --- ## Take away message (important bits) Linear model - continuous and categorical predictors <br/> <br/> Interpreting estimated coefficients <br/> <br/> Interpreting results (significance, determination coefficient, F-test, residuals) <br/> <br/> Understanding assumptioms of linear model <br/> <br/> Using linear models in R: overall and stepwise approach <br/> <br/><br/> <br/> Higher proficiency: <br/> <br/> Model diagnostics based on pre and post-modelling procedures covered by the lecture and beyond --- ## Discussion points 1. Approach <br/> <br/> We have two options: <br/> <br/> a) We meet every Friday, where I present the lecture (recorded) - a major gain is that we can always stop the presentation and discuss the content <br/> b) I pre-record the lecture and we use this time slot to discuss it - major gain, more time to discuss (workshop type approach), while the major drawback is that you need to watch the lecture before we meet <br/> <br/> 2. Based on today's lecture (take into account that I was not going in depths of linear regression), would you prefer more:<br/><br/> a) Theory - mathematical aspect + explanations <br/> b) Examples - range of different problems <br/> c) Practice - more code and focus on understanding the data <br/> d) Stay same, never change <br/> <br/> --- ## Literature First and second chapter of "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill <br/> <br/> First three chapter of "Regression Analysis and Linear Models: Concepts, Applications, and Implementation" by Richard B. Darlington and Andrew F. Hayes <br/> <br/> van Rij, J., Vaci, N., Wurm, L. H., & Feldman, L. B. (2020). Alternative quantitative methods in psycholinguistics: Implications for theory and design. Word Knowledge and Word Usage, 83-126. --- ## Exercise 1. Find a data with an interesting measures and load it in R [FiveThirtyEight](https://data.fivethirtyeight.com/), [Kaggle](https://www.kaggle.com/) <br/><br/> 2. Make a histogram and density plot <br/> <br/> 3. Calculation correlations between two or more variables <br/><br/> 4. If there is published work that used the dataset, try to repeat the analysis (only few variables and only if it is linear regression) <br/><br/> 5. Make a model where you have high or a perfect multicolinearity between two predictors (you can simulate data) <br/><br/> 6. Make a model where you have `\(R^2=1\)` --- # Thank you for your attention