The data that we will use for this analysis will be from our labor dataset and the analysis we will do is called a linear regression. But first you must filter it to include both Position 1 and Position 2 employees.

We have explored the dataset and have developed a question: What factors influence $/hr of position 1 and 2 employees?To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the $/hr as the dependent variable (DV), and explore other variables of the employee as independent variables (IV). First, a brief refresher on linear regression. This is only a simplified review. For a better understanding, please refer to DAL 5 and review other statistical methods and linear regression materials as needed.

A linear regression is a way to model the statistical relationship between a response (or dependent variable) and one or more explanatory (or independent) variables. The linear relationship between the two variables may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship/correlation.

In Figure 1 we can see that as the independent variable

increases on the horizontal axis from left to right, the dependent variable tends to increase in value, although the increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV and DV.It also tests for the linearity of the relationship.

60

40

20

0

051015202530

Independent Variable

*F**i**gure 1. Positive relationship between IV and DV*

In Figure 2 we can see that as the independent variable

value of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV) that have a stronger relationship with the response variable (DV).

40

30

20

10

0

051015202530

Independent Variable

*F**i**gure 2. Neutral relationship between IV and DV*

In Figure 3 we can see that as the independent variable increases on the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based on our observation of this scatterplot between the IV and DV.

60

40

20

0

051015202530

Independent Variable

*Fi**gure 3. Negative relationship between IV and DV*

Again, a linear regression is way to model a hypothesized **s****tatistical **relationship between a predictor variable

- and a response variable (DV). What is the difference between a deterministic relationship and a statistical relationship?

In a deterministic relationship, an equation **e****xactly **describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. This will have an R-squared of 1.

*F**i**gure 4. Deterministic linear equation*

In a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in a scatterplot. So, we must modify our linear equation to find the best fitting line that best “describes” the relationship between the predictor variable and the response variable. Data analysis software like SAS and Excel do this by adjusting the position of the line and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized.

*ŷ**i *= the predicted response (or fitted value)

*b**o *= the estimated Y axis intercept of the best fitting line

*b**1 *= the estimated slope of the best fitting line

*x**i *= the predictor variable value (IV value)

*y**i *= the observed response value (DV value)

*β**0 *= estimated population regression line constant

*β**1 *= estimated population regression line slope

*ε**i *= error term (difference between *ŷ**i *and *y**i*) aka residuals

*ŷ**i *= *b**o *+ *b**1 **x**i*

*y**i *= *β**0 *+ *β**1 **x**i *+ *ε**i*

*F**i**gure 5. Statistical linear equation*

So, as we can see in the equations in Figure 5, the statistical linear relationship **a****p****proximately **describes the relationship between the predictor value and the response value instead of the exact relationship described in a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0).

In testing the null hypothesis for a simple linear regression, we should generally follow these steps:

1.**State the plain language research question: **e.g. What factors influence $/hr for position 1 and 2 employees?

**•**α = 0.05

i.Is your data a “snap shot” or a “video” of your observations? If your data is more of a

“video”, consider a time series analysis.

- Non-significant Chi Square

- No triangular looking patterns between the response variable and the standardized residuals, and
- Non-significant Chi Square

i.Outliers can cause erroneous results (Cook’s D > ±2)

- The linear regression may not be the best fit (curvilinear, quadratic, etc.)
- Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis.

iii.Large data sets can result in significance (P value) but not really different from 0

Please watch the videos for detailed instructions.

Interpreting the RegressionThe data that we will use for this analysis will be from our labor dataset

and the analysis we will do is called

a linear regression. But first you must filter it to

include both Position 1 and Position 2 employees.

We have explored the dataset and have developed a question: What factors influence $/hr of position 1 and 2

employees? To find the answer to this question, I am going to analyze the dataset and perform a linear regression (LR) using the $/hr as the dependent variable (DV), and explore other variables of the employee

as independent variables (IV). First, a brief

refresher on linear regression. This is only a simplified review. For a better understanding, please refer to DAL

5 and review other statistical methods and linear regression materials as needed.

A linear regression is a way to model the

statistical relationship between a response (or dependent variable)

and one or more explanatory (or independent) variables. The linear relationship between the two variables

may be represented by a straight line, often called the regression line. Simply put, we want to see if the IV

value can be used to accurately predict the DV value. Often, we can visually see if there is a relationship

between the IV and DV by creating a scatterplot based on the data. See the following figures for examples of a positive, neutral, and negative relationship/correlation.

In Figure 1 we can see that as the independent variable

increases on the horizontal axis from left to right, the dependent variable tends to increase in value, although

the increase is not consistent due to error, or other explanatory variables not used in our model. We would need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong positive relationship based on our observation of this scatterplot between the IV and

DV. It also tests for the linearity of

the relationship.

Positive Relationship

60

40

20

0

0 5 10 15 20 25 30

Independent Variable

Figure 1. Positive relationship between IV and DV

In Figure 2 we can see that as the independent variable

value of the dependent variable seems to vary randomly above and below the regression line. The slope of the regression line might be close to zero. This would indicate that for any value of the independent variable, the value of the dependent variable is equal to the constant plus some random error value that we do not know. There may be other explanatory variables (IV)

that have a stronger relationship with the response variable (DV).

Neutral Relationship

40

30

20

10

0

0 5 10 15 20 25 30

Independent Variable

Figure 2. Neutral relationship between IV and DV

In Figure 3 we can see that as the independent variable increases

on the horizontal axis from left to right, the dependent variable tends to decrease in value, although the decrease is not consistent due to error, or other explanatory variables not used in our model. We would

need to perform a linear regression to analyze the strength of the relationship and identify the parameters (the constant and the slope) of the regression line. We might suspect a strong negative relationship based on

our observation of this scatterplot between the IV and

DV.

Negative Relationship

60

40

20

0

0 5 10 15 20 25 30

Independent Variable

Figure 3. Negative relationship between IV and DV

Again, a linear regression is way to model a hypothesized statistical relationship between a predictor variable

(IV) and a response variable (DV). What is the difference between a deterministic relationship and a statistical

relationship?

In a deterministic relationship, an equation exactly describes the relationship between two or more variables. Examples are the relationship between Fahrenheit and Celsius (oF = 9/5*oC + 32), and the relationship between circumference and diameter (Circumference = π * diameter). In a deterministic relationship, there are no error terms to consider, and we can create a simple linear equation to model the relationship as depicted in Figure 4. This will have an R-squared of 1.

Y value = constant +

(slope * X value)

Figure 4. Deterministic linear equation

In a statistical linear relationship, there is a trend (positive, neutral, or negative), plus a constant, plus some error that we see as the “scatter” in a

scatterplot. So, we must modify

our linear equation to find the best fitting line that best

“describes” the relationship between the predictor variable and

the response variable.

Data analysis software like SAS

and Excel do this by adjusting the position of the

line and the slope until the sum of all the squared errors (difference between predicted and observed responses) has been minimized.

ŷi = the predicted response (or fitted value)

bo = the estimated Y axis intercept of the best fitting line

b1 = the estimated

slope of the best fitting line

xi = the predictor variable value (IV value)

yi = the observed response value (DV value)

β0 = estimated population regression line constant

β1 = estimated population regression line slope

εi = error term (difference between ŷi and yi) aka residuals

ŷi = bo + b1 xi

yi = β0 + β1 xi + εi

Figure 5. Statistical linear equation

So, as we can see in the equations in Figure 5, the statistical linear relationship approximately describes the relationship between the predictor value and the response value instead of the exact relationship described

in a deterministic linear equation. Thus, we need to determine if β1 is not equal to zero (β1 ≠ 0).

In testing the null hypothesis for a simple linear regression, we should generally follow these steps:

1.

State the plain language research question: e.g. What factors influence $/hr for position 1 and 2 employees?

2.

State the hypotheses:

•

Null hypothesis – HO: βPerformance = 0

•

Alternative hypothesis – HA: βperformance ≠ 0

3.

State the criteria for rejecting HO:

• α = 0.05

4.

Consider the assumptions for linear regression:

•

Assumption that there is a linear relationship between response variable and predictor variable (You should use scatter plots of the

individual continuous independent variables compared to the dependent).

•

Assumption that the errors, εi, are independent (research design)

i.

Is your data a “snap shot” or a “video” of your observations? If

your data is more of a

“video”, consider a time series analysis.

ii. Non-significant Chi Square

•

Assumption that the errors, εi, at each value of the predictor, xi, are normally distributed (not skewed with a mean of zero) (non-significant Shapiro-Wilks statistic indicates normal

distribution of error terms).

(Examine the residual plots)

•

Assumption that the errors, εi, at each value of the predictor, xi, have equal variances (σ2)

i. No triangular looking patterns between the response variable and the standardized

residuals, and

ii.

Non-significant Chi Square

•

Other items to consider:

i.

Outliers can

cause erroneous results (Cook’s D > ±2)

ii. The linear regression may not be the best fit (curvilinear, quadratic, etc.)

iii.

Large data sets can result in significance (P value) but not really different from 0

iv. Averages of raw data (e.g. summing a region) can overstate the strength of the correlation, so be mindful of what you are trying to prove with your analysis.

5.

Compute the appropriate statistics:

•

Pearson correlation coefficient (remember that correlation does not imply causation!)

•

F-Value

•

Prob >

F

•

Did you observe any problematic outliers? What (if anything) can you do about them?

6.

Decide whether to retain or reject your null hypothesis:

•

If p > α, then

retain the null

hypothesis

•

If p < α, then reject the null hypothesis,

and accept the alternative hypothesis

•

Remember, that statistical significance does not imply practical or meaningful significance!

7.

Interpret the parameters (β0 and β1):

•

What does a one unit increase in the predictor variable result in the expected response variable

(what is the slope of the regression line)? Is it positive or negative? Is it meaningful?

•

Is zero within your predictor variable (IV) value range? What does that mean?

Please watch the videos for detailed instructions.

Are you overwhelmed by an intense schedule and facing difficulties completing this assignment? We at GrandHomework know how to assist students in the most effective and cheap way possible. To be sure of this, place an order and enjoy the best grades that you deserve!

Post Homework