Introduction To Linear Regression Analysis Montgomery 347.pdf
Chapter Outline 1.0 Introduction 1.1 A First Regression Analysis 1.2 Examining Data 1.3 Simple linear regression 1.4 Multiple regression 1.5 Transforming variables 1.6 Summary 1.7 For more information
Introduction To Linear Regression Analysis Montgomery 347.pdf
This web book is composed offour chapters covering a variety of topics about using SAS for regression. We shouldemphasize that this book is about "data analysis" and that it demonstrates howSAS can be used for regression analysis, as opposed to a book that covers the statisticalbasis of multiple regression. We assume that you have had at least one statisticscourse covering regression analysis and that you have a regression book that you can useas a reference (see the Regression With SAS page and our Statistics Books for Loan page for recommended regressionanalysis books). This book is designed to apply your knowledge of regression, combine itwith instruction on SAS, to perform, understand and interpret regression analyses.
This first chapter will cover topics in simple and multiple regression, as well as thesupporting tasks that are important in preparing to analyze your data, e.g., datachecking, getting familiar with your data file, and examining the distribution of yourvariables. We will illustrate the basics of simple and multiple regression anddemonstrate the importance of inspecting, checking and verifying your data before acceptingthe results of your analysis. In general, we hope to show that the results of yourregression analysis can be misleading without further probing of your data, which couldreveal relationships that a casual analysis could overlook.
We will not go into all of the details of this output. Note that there are 400observations and 21 variables. We have variables about academic performance in 2000and 1999 and the change in performance, api00, api99 and growthrespectively. We also have various characteristics of the schools, e.g., class size,parents education, percent of teachers with full and emergency credentials, and number ofstudents. Note that when we did our original regression analysis it said that therewere 313 observations, but the proc contents output indicates that we have 400observations in the data file.
We see that among the first 10 observations, we have four missing values for meals. It is likely that the missing data for meals had something to do with thefact that the number of observations in our first regression analysis was 313 and not 400.
Another kind of graph that you might want to make is a residual versus fittedplot. As shown below, we can use the plot statement to make thisgraph. The keywords residual. and predicted. in this contextrefer to the residual value and predicted value from the regression analysisand can be abbreviated as r. and p. .
Finally, as part of doing a multiple regression analysis you might be interested inseeing the correlations among the variables in the regression model. You can do thiswith proc corr as shown below.
Earlier we focused on screening your data for potential errors. In the nextchapter, we will focus on regression diagnostics to verify whether your data meet theassumptions of linear regression. Here, we will focus on the issueof normality. Some researchers believe that linear regression requires that the outcome (dependent)and predictor variables be normally distributed. We need to clarify this issue. Inactuality, it is the residuals that need to be normally distributed. In fact,the residuals need to be normal only for the t-tests to be valid. The estimation of theregression coefficients do not require normally distributed residuals. As we areinterested in having valid t-tests, we will investigate issues concerning normality.
In this lecture we have discussed the basics of how to perform simple and multipleregressions, the basics of interpreting output, as well as some related commands. Weexamined some tools and techniques for screening for bad data and the consequences suchdata can have on your results. Finally, we touched on the assumptions of linearregression and illustrated how you can check the normality of your variables and how youcan transform your variables to achieve normality. The next chapter will pick upwhere this chapter has left off, going into a more thorough discussion of the assumptionsof linear regression and how you can use SAS to assess these assumptions for your data. In particular, the next lecture will address the following issues.
a, Retrospective duration estimates (minutes) as a function of veridical clock duration (minutes) during lockdown (S1; pink) and outside of it (SC; grey). Each dot represents a single participant. The regression lines were estimated from the linear mixed effect model; their 95% CIs are shown with grey shading. b, Relative retrospective duration estimates (unitless) as a function of the stringency index (a.u. between 0 and 100) for all sessions (coloured). The coloured dots are individual data points per participant and per session. The regression line was estimated from the linear model; the 95% CI is shown with grey shading. The more stringent governmental rules were, the shorter retrospective durations were estimated to be. c, Relative retrospective duration estimates (unitless) as a function of the mobility index (percent change relative to baseline, prior to lockdown; see the main text) for all sessions (coloured). Each dot is an individual data point per participant and per session. The black line is a regression line estimated from the linear model; the 95% CI is shown with grey shading. The closer to baseline mobility, the shorter retrospective durations were estimated to be.
a, Distribution of VAS rating (0 to 100) counts for passage-of-time judgements as a function of session (colour coded). b, Passage-of-time ratings as a function of subjective confinement (5 to 20). The grey dots are individual data points (per participant, per session, per run). The black dots are the mean passage-of-time ratings binned by subjective confinement. Their size scales with the underlying number of individual data points. The black line is a regression line estimated from the linear mixed effect model; the 95% CI is shown with grey shading. The less lonely the participants felt, the faster the passage of time felt.