# DAX 4 3345 Troy University

The data that we will use for this analysis will be from our labor dataset and the analysis we will do is called a linear regression. But first you must filter it to include both Position 1 and Position 2 employees.

We have explored the dataset and have developed a question: What factors influence \$/hr of position 1 and 2 employees?To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the \$/hr as the dependent variable (DV), and explore other variables of the employee as independent variables (IV). First, a brief refresher on linear regression. This is only a simplified review. For a better understanding, please refer to DAL 5 and review other statistical methods and linear regression materials as needed.

A linear regression is a way to model the statistical relationship between a response (or dependent variable) and one or more explanatory (or independent) variables. The linear relationship between the two variables may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship/correlation.

In Figure 1 we can see that as the independent variable

increases on the horizontal axis from left to right, the dependent variable tends to increase in value, although the increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV and DV.It also tests for the linearity of the relationship.

# Positive Relationship

60

40

20

0

051015202530

Independent Variable

Figure 1. Positive relationship between IV and DV

In Figure 2 we can see that as the independent variable

value of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV) that have a stronger relationship with the response variable (DV).

# Neutral Relationship

40

30

20

10

0

051015202530

Independent Variable

Figure 2. Neutral relationship between IV and DV

In Figure 3 we can see that as the independent variable increases on the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based on our observation of this scatterplot between the IV and DV.

# Negative Relationship

60

40

20

0

051015202530

Independent Variable

Figure 3. Negative relationship between IV and DV

Again, a linear regression is way to model a hypothesized statistical relationship between a predictor variable

• and a response variable (DV). What is the difference between a deterministic relationship and a statistical relationship?

In a deterministic relationship, an equation exactly describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. This will have an R-squared of 1.

# Y value = constant + (slope * X value)

Figure 4. Deterministic linear equation

In a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in a scatterplot. So, we must modify our linear equation to find the best fitting line that best “describes” the relationship between the predictor variable and the response variable. Data analysis software like SAS and Excel do this by adjusting the position of the line and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized.

ŷi = the predicted response (or fitted value)

bo = the estimated Y axis intercept of the best fitting line

b1 = the estimated slope of the best fitting line

xi = the predictor variable value (IV value)

yi = the observed response value (DV value)

β0 = estimated population regression line constant

β1 = estimated population regression line slope

εi = error term (difference between ŷi and yi) aka residuals

ŷi = bo + b1 xi

yi = β0 + β1 xi + εi

Figure 5. Statistical linear equation

So, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the relationship between the predictor value and the response value instead of the exact relationship described in a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0).

In testing the null hypothesis for a simple linear regression, we should generally follow these steps:

1.State the plain language research question: e.g. What factors influence \$/hr for position 1 and 2 employees?

α = 0.05

## 4.Consider the assumptions for linear regression:

i.Is your data a “snap shot” or a “video” of your observations? If your data is more of a

“video”, consider a time series analysis.

• Non-significant Chi Square
• No triangular looking patterns between the response variable and the standardized residuals, and
• Non-significant Chi Square

i.Outliers can cause erroneous results (Cook’s D > ±2)

• The linear regression may not be the best fit (curvilinear, quadratic, etc.)
• Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis.

iii.Large data sets can result in significance (P value) but not really different from 0

## 7.Interpret the parameters (β0 and β1):

Please watch the videos for detailed instructions.

Filtered Dataset Video

Running Correlation

Interpreting Correlation

Running the Regression

Interpreting the RegressionThe data that we will use for this analysis will be from our labor dataset

and the analysis we will do is called
a linear regression. But first you must filter it to
include both Position 1 and Position 2 employees.
We have explored the dataset and have developed a question: What factors influence \$/hr of position 1 and 2
employees? To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the \$/hr as the dependent variable (DV), and explore other variables of the employee
as independent variables (IV). First, a brief
refresher on linear regression. This is only a simplified review. For a better understanding, please refer to DAL
5 and review other statistical methods and linear regression materials as needed.
A linear regression is a way to model the
statistical relationship between a response (or dependent variable)
and one or more explanatory (or independent) variables. The linear relationship between the two variables
may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV
value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship
between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship/correlation.
In Figure 1 we can see that as the independent variable
increases on the horizontal axis from left to right, the dependent variable tends to increase in value, although
the increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV and
DV. It also tests for the linearity of
the relationship.
Positive Relationship
60
40
20
0
0 5 10 15 20 25 30
Independent Variable
Figure 1. Positive relationship between IV and DV
In Figure 2 we can see that as the independent variable
value of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV)
that have a stronger relationship with the response variable (DV).
Neutral Relationship
40
30
20
10
0
0 5 10 15 20 25 30
Independent Variable
Figure 2. Neutral relationship between IV and DV
In Figure 3 we can see that as the independent variable increases
on the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We would
need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based on
our observation of this scatterplot between the IV and
DV.
Negative Relationship
60
40
20
0
0 5 10 15 20 25 30
Independent Variable
Figure 3. Negative relationship between IV and DV
Again, a linear regression is way to model a hypothesized statistical relationship between a predictor variable
(IV) and a response variable (DV). What is the difference between a deterministic relationship and a statistical
relationship?
In a deterministic relationship, an equation exactly describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. This will have an R-squared of 1.
Y value = constant +
(slope * X value)
Figure 4. Deterministic linear equation
In a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in a
scatterplot. So, we must modify
our linear equation to find the best fitting line that best
“describes” the relationship between the predictor variable and
the response variable.
Data analysis software like SAS
and Excel do this by adjusting the position of the
line and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized.
ŷi = the predicted response (or fitted value)
bo = the estimated Y axis intercept of the best fitting line
b1 = the estimated
slope of the best fitting line
xi = the predictor variable value (IV value)
yi = the observed response value (DV value)
β0 = estimated population regression line constant
β1 = estimated population regression line slope
εi = error term (difference between ŷi and yi) aka residuals
ŷi = bo + b1 xi
yi = β0 + β1 xi + εi
Figure 5. Statistical linear equation
So, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the relationship between the predictor value and the response value instead of the exact relationship described
in a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0).
In testing the null hypothesis for a simple linear regression, we should generally follow these steps:
1.
State the plain language research question: e.g. What factors influence \$/hr for position 1 and 2 employees?
2.
State the hypotheses:

Null hypothesis – HO: βPerformance = 0

Alternative hypothesis – HA: βperformance ≠ 0
3.
State the criteria for rejecting HO:
• α = 0.05
4.
Consider the assumptions for linear regression:

Assumption that there is a linear relationship between response variable and predictor variable (You should use scatter plots of the
individual continuous independent variables compared to the dependent).

Assumption that the errors, εi, are independent (research design)
i.
Is your data a “snap shot” or a “video” of your observations? If
your data is more of a
“video”, consider a time series analysis.
ii. Non-significant Chi Square

Assumption that the errors, εi, at each value of the predictor, xi, are normally distributed (not skewed with a mean of zero) (non-significant Shapiro-Wilks statistic indicates normal
distribution of error terms).
(Examine the residual plots)

Assumption that the errors, εi, at each value of the predictor, xi, have equal variances (σ2)
i. No triangular looking patterns between the response variable and the standardized
residuals, and
ii.
Non-significant Chi Square

Other items to consider:
i.
Outliers can
cause erroneous results (Cook’s D > ±2)
ii. The linear regression may not be the best fit (curvilinear, quadratic, etc.)
iii.
Large data sets can result in significance (P value) but not really different from 0
iv. Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis.
5.
Compute the appropriate statistics:

Pearson correlation coefficient (remember that correlation does not imply causation!)

F-Value

Prob >
F

Did you observe any problematic outliers? What (if anything) can you do about them?
6.
Decide whether to retain or reject your null hypothesis:

If p > α, then
retain the null
hypothesis

If p < α, then reject the null hypothesis,
and accept the alternative hypothesis

Remember, that statistical significance does not imply practical or meaningful significance!
7.
Interpret the parameters (β0 and β1):

What does a one unit increase in the predictor variable result in the expected response variable
(what is the slope of the regression line)? Is it positive or negative? Is it meaningful?

Is zero within your predictor variable (IV) value range? What does that mean?
Please watch the videos for detailed instructions.

## Don't hesitate - Save time and Excel

Are you overwhelmed by an intense schedule and facing difficulties completing this assignment? We at GrandHomework know how to assist students in the most effective and cheap way possible. To be sure of this, place an order and enjoy the best grades that you deserve!

Post Homework
Top