Using any data of interest to your group, compile a data set comprised of one predictor and one
response variable with at least 20 observations (data points), and answer the questions below.
Your project needs to be typed and plots can be made using any software of your choice. Only one
project (with each member�s name) per group needs to be submitted. Your project should include
all the observations used.
Provide a brief description of your project. Make sure to identify the predictor and response
variables, as well as discussing the objective of your regression model.
1. (20%) All your answers must be in the order in which the questions are asked, otherwise you will be
deducted 20%. Note: Even if only one answer is out of order you will still be deducted 20%.
2. (15%) For your predictor and response variables:
(a) compute the range and IQR.
(b) make a histogram of your data.
(c) make a boxplot of your data.
3. (25%) Make a scatterplot of your data and describe the:
(a) Direction
(b) Form
(c) Strength
(d) Correlation
(g) Outliers
4. (40%) Based on your data, construct a linear regression model of your response variable as a function
of your predictor variable following the steps below:
(a) Compute �x and �y
(b) Compute sx and sy
(c) Compute r
(d) Compute a and b
(e) Construct the respective Least Squares line and plot it over your scatter plot.
(f) Compute the respective R2 and interpret your results.
(g) For your model, compute and plot the residuals vs x. Describe what you observe from
this plot.
(h) Are there any outliers? If so, are they high leverage and/or influential.
(i) Based on your model, make 3 predictions for your response variable (i.e., use 3 different
values of x that are not in your data, and compute the respective y value
Question One
The data below was obtained from an organization that wanted to estimate the cost of leasing a building given the contract value for constructing the building. It follows that the contract value was the predictor variable while the estimated cost is the response variable.
Estimated cost | Contract value | ||
85,000 | 310,000 | 100,000 | 360,000 |
70,000 | 305,000 | 120,000 | 370,000 |
110,000 | 180,000 | 150,000 | 200,000 |
90,000 | 170,000 | 80,000 | 250,000 |
130,000 | 160,000 | 180,000 | 300,000 |
160,000 | 110,000 | 190,000 | 160,000 |
160,000 | 150,000 | 200,000 | 210,000 |
280,000 | 180,000 | 350,000 | 230,000 |
130,000 | 175,000 | 180,000 | 250,000 |
320,000 | 180,000 | 380,000 | 270,000 |
Question Two
- compute the range and IQR.
Range
Constructed value =380,000-80,000
=300,000
Estimated cost = 320,000-70,000
=250,000
Quartile Range
Constructed value = 300000- 175000
=125000
Estimated cost = 197500- 125000
= 72500
(b) Make a histogram of your data.
(c) make a boxplot of your data.
Question Three
(a) Direction
The direction of a relationship tells whether the values on two variables go up
and down together. The nature of the plot indicates direction. If two variables have a positive direction, then as the values on one variable go up, so do the values on the other variable. The data used has a positive direction because the points of the scatter plots run from the lower left to the upper right. This implies that as the vales of the contract value go up so does the value of the estimated cost and vice versa.
(b) Form
The shape of the plot could explain the form of the scatter plot. This is because there are instances where the plot has a curved shape. In other instances, the plot could have a straight line plot. If there is a linear relationship, then the plot will appear to swarm or cloud in a generally straight and consistent form. The plot above indicates that the data points are straight and consistent. I.e. there is a linear relationship between the estimated cost and the contract value.
- Strength
The strength of the relationship between variables is determined by how close the plotted points are from one another. Closely placed points indicate a strong relationship between the variables. In this case, the points are neither close nor far from each other. Therefore, there is a moderate relationship between the variables.
- Correlation
The correlation between two variables measures the strength and direction of the relationship between the variables. The strength and direction of the variables have already been established in the previous paragraphs. Therefore, we conclude that there is a moderate positive relationship between the variables.
(g) Outliers
The extreme points in a scatter plot identify outliers. In this case, there are four outliers. The box plot has also demonstrated this.
Question Four
(a) Compute ¯x and ¯y
Mean for estimated cost is given by the sum of all the observations divided by the number of observations.
¯x = 3,455,000/20
=172750
The mean for the contract value is given by the sum of all the observations divided by the number of observations.
¯y =4,530,000/20
=226,500
- Compute sx and sy
The standard deviation of the variables is given by taking the square root of the sum of all the deviations from the mean and dividing by the number of observations less by one.
The standard deviation for the estimated cost is
Sd = (107,323,750,000/19) ^1/2
= 75157.2912
The standard deviation for the contract value is
Sd = (209,836,250,000/19) ^1/2
= 105090.4998
- Compute r
The correlation coefficient is given by the following formula.
Estimated cost (Y) | Contract value (X) | XY | X2 | Y2 |
85,000 | 100,000 | 8500000000 | 7,225,000,000 | 10,000,000,000 |
70,000 | 120,000 | 8400000000 | 4,900,000,000 | 14,400,000,000 |
110,000 | 150,000 | 16500000000 | 12,100,000,000 | 22,500,000,000 |
90,000 | 80,000 | 7200000000 | 8,100,000,000 | 6,400,000,000 |
130,000 | 180,000 | 23400000000 | 16,900,000,000 | 32,400,000,000 |
160,000 | 190,000 | 30400000000 | 25,600,000,000 | 36,100,000,000 |
160,000 | 200,000 | 32000000000 | 25,600,000,000 | 40,000,000,000 |
280,000 | 350,000 | 98000000000 | 78,400,000,000 | 122,500,000,000 |
130,000 | 180,000 | 23400000000 | 16,900,000,000 | 32,400,000,000 |
320,000 | 380,000 | 121600000000 | 102,400,000,000 | 144,400,000,000 |
310,000 | 360,000 | 111600000000 | 96,100,000,000 | 129,600,000,000 |
305,000 | 370,000 | 112850000000 | 93,025,000,000 | 136,900,000,000 |
180,000 | 200,000 | 36000000000 | 32,400,000,000 | 40,000,000,000 |
170,000 | 250,000 | 42500000000 | 28,900,000,000 | 62,500,000,000 |
160,000 | 300,000 | 48000000000 | 25,600,000,000 | 90,000,000,000 |
110,000 | 160,000 | 17600000000 | 12,100,000,000 | 25,600,000,000 |
150,000 | 210,000 | 31500000000 | 22,500,000,000 | 44,100,000,000 |
180,000 | 230,000 | 41400000000 | 32,400,000,000 | 52,900,000,000 |
175,000 | 250,000 | 43750000000 | 30,625,000,000 | 62,500,000,000 |
180,000 | 270,000 | 48600000000 | 32,400,000,000 | 72,900,000,000 |
3,455,000 | 4,530,000 | 903,200,000,000 | 704,175,000,000 | 1,178,100,000,000 |
= 0.94439147
Compute a and b
a = -6958.173
b = 0.793
(e) Construct the respective Least Squares line and plot it over your scatter plot.
Estimated Cost = -6958.173 + 0.793 contract value
(f) Compute the respective R2 and interpret your results.
= 0.89187525
This implies that 89 percent of the variation in expected cost is explained by the variation in the contract value.
(g) For your model, compute and plot the residuals vs. x. Describe what you observe from this plot.
The residual plot above indicates that the data has a constant and independent variance because the plots are consistent regardless of the contract value. It is also clear that the data follows a normal distribution form the normal probability plot below.
(h) Are there any outliers? If so, are they high leverage and/or influential?
There are outliers in the data but they are neither high leveraged or influential.
Based on your model, make 3 predictions for your response variable
Using the following equation Estimated Cost = -6958.173 + 0.793 contract value
The predicted value for three values is indicated in the table below.
Contract Value | 276000 | 302000 | 144000 |
Predicte Estimated Cost | 212023.9716 | 232652.7243 | 107293.3807 |