Fox Module 10 R2 practice problems


Fox Module 10 R2 practice problems

Author
Message
NEAS
Supreme Being
Supreme Being (6K reputation)Supreme Being (6K reputation)Supreme Being (6K reputation)Supreme Being (6K reputation)Supreme Being (6K reputation)Supreme Being (6K reputation)Supreme Being (6K reputation)Supreme Being (6K reputation)Supreme Being (6K reputation)

Group: Administrators
Posts: 4.5K, Visits: 1.6K

Fox Module 10 R2 practiceproblems

(The attached PDF file has betterformatting.)

** Exercise 10.1: R2

A simple linear regression with anintercept and one explanatory variable fit to 18 observations has a total sum ofsquares (TSS) = 256 and s2 (the ordinary least squares estimator foró2) = 4.


A. Howmany degrees of freedom does the regression equation have?

B. Whatis RSS, the residual sum of squares?

C. Whatis RegSS, the regression sum of squares?

D. Whatis the R2 of the regression equation?

E. Whatis the adjusted (corrected) R2 of the regression equation?

F. Whatis the correlation of the explanatory variable and the response variable?

G. Whatis the F-value for the omnibus F-test?

H. Whatis the t-value for the explanatory variable?

Part A: The regression equation has N – k – 1= 18 – 1 – 1 = 16 degrees of freedom.

Take heed: In this equation, k is thenumber of explanatory variables not including the intercept á.

Part B: The estimate of the variance of theerror term (s2) is the residual (error) sum of squaresdivided by the number of degrees of freedom, or N – k: s2 =RSS / df, so the residual sum of squares (RSS) = s2 × degrees offreedom = 4 × 16 = 64.

Part C: The regression sum of squares (RegSS)= TSS – RSS.


The total sumof squares TSS is the sum of the squared residuals, given in the problem as256.

RSS = s2 ×(N – 2) = 4 × 16 = 64.

RegSS = 256 – 64 =192.


Part D: The R2 = RegSS / TSS = 1 – RSS/TSS = 192 / 256 = 75%.

Part E: Adjusted R2 = 1 – (1 – R2)× (N – 1) / (N – k) = 1 – (1 – 75%) × 17 / 16 = 73.44%

Part F: The correlation ñ(x,y) = r = √R2 = √75% =0.866.

Part G:Fox, Chapter 6, page 109: for theomnibus F-test in a simple linear regression, R02 = 0 andk = 1, so

F = (N – 2) × R2 / (1 – R2)= (18 – 2) × 0.75 / (1 – 0.75) = 48.000

Part H: The t-value for simple linearregression is the square root of the F value: √48 = 6.928


[This practice problem is an essayquestion, reviewing the meaning of the significance tests, goodness-of-fittests, and measures of predictive power. It relates the statistical tests tothe form of the regression line, emphasizing the intuition. Final exam problemstest specific items in a multiple choice format.]

** Exercise 10.2: Measures of significance

The R2, the adjusted(corrected) R2, the s2 (the ordinary least squaresestimator for ó2), the t-value, and the F-value measure thesignificance, goodness-of-fit, or predictive power of the regression.


A. Whatdoes the R2 measure?

B. Whatdoes the adjusted (corrected) R2 measure?

C. Whenis it important to use the adjusted (corrected) R2 instead of thesimple R2?

D. Ifthe R2 0, what can one sayabout the regression?

E. Ifthe R2 1, what can one say about the regression?

F. Whatdoes the s2 measure?

G. GivenR2, what is the F-value for the omnibus F-test?

H. Whatdoes the F-value measure?

I. Ifthe F-value 0, what can one say aboutthe regression?

Part A: R2 measures the percentageof the total sum of squares explained by the regression, or RegSS / TSS.

Jacob: Why does the textbook show the R2as 1 – RSS / TSS? This is equivalent, since RSS + RegSS = TSS.

Rachel: To adjust for degrees of freedom (forthe corrected R2), we adjust RSS and TSS. The format R2 =1 – RSS / TSS makes it easier to understand the adjustment for degrees offreedom.

Jacob: Does the R2 measure if theregression analysis is significant? The textbook gives significance levels fort-values and F-values (and associated confidence intervals for the regressioncoefficients), but it does not give significant levels for R2.

Rachel: R2 combines two items:whether the explanatory variables have predictive power and whether theregression coefficients are significantly different from zero (or from anothernull hypothesis). This exercise reviews the concepts and explains what R2implies vs what s2 and the F-value imply.

Part B: R2 does not adjust for degreesof freedom. If the regression has N data points and uses N explanatoryvariables (or N-1 independent variables + 1 intercept), all points are fitexactly, and the R2 = 100%. This is true even if the explanatoryvariables have no predictive power: that is, each explanatory variable isindependent of the response variable.

The same problem exists even if thenumber of explanatory variables is less than the number of data points. Even ifthe explanatory variables are independent of the response variable and have nopredictive power, the R2 is always more than zero.

The adjusted (corrected) R2adjusts for degrees of freedom. The degree of freedom apply to RSS and TSS, notto RegSS. With N data points and k independent variables (= k+1explanatory variables including the intercept), the TSS has N-1 degrees offreedom and the RSS has N-k-1 degrees of freedom.

Fox explains: R2 is 1 – RSS/ RSS = the complement of (the residual sum of squares / total sum of squares).The adjusted R2 is the complement of (the residual variance / thetotal variance).

The adjusted (corrected) R2= 1 – (RSS / N-k-1) / (TSS / N-1).

The R2 is a ratio of sumsof squares and the adjusted (corrected) R2 is a ratio of variances.

Part C: For most regression analyses, the R2is fine. It says what percentage of the variation in the sample values isexplained by the regression. This percentage is not used for tests ofsignificance, so a slight over-statement is not a problem.

Jacob: Is the R2 over-stated? Thetextbook does not say that is over-stated.

Rachel: The R2 says whatpercentage of the variation in the sample values is explained by theregression. It is the correct percentage, not over- or under-stated. Some ofthe explanation is spurious, caused by random fluctuations in small datasamples. The adjusted R2 says: What would the R2 be if wehad an infinite number of data points?

Jacob: This adjustment seems proper; why dowe still use the simple R2?

Rachel: We have a simple data set; we don’tknow what the R2 would be if we had an infinite number of datapoints. We estimate the expected correction. This estimate is unbiased, but itis sometimes too high and sometimes too low.

To compare regression equations withdifferent degrees of freedom, one must use the adjusted R2. Forexample, suppose one regresses a response variable Y on several explanatoryvariables. One might say that the best regression equation is the one which explainsthe largest percentage of the variation in the response variable. R2is not a valid measure, since adding an explanatory variable always increasesthe R2, even if the explanatory variable is unrelated to theresponse variable. Instead, we choose the regression equation with the highestadjusted R2.

Part D: If R2 is close to zero,the explanatory variables explain almost none of the variance in the responsevariable. For a simple linear regression with one explanatory variable, thecorrelation of X and Y is close to zero.

Jacob: Suppose we draw a scatterplot of Yagainst X. If R2 is close to zero, is the scatterplot a cloud ofpoints with no pattern?

Rachel: The R2 reflects twothings: the variance of the error term and the slope of the regression line.The variance of the error term compared to the dispersion of the responsevariable determines whether the scatterplot is a cloud of points with no clearpattern or a set of points lying next to the regression line. The slope of theregression line (the â coefficient)determines whether the explanatory variable much affects the response variable.

The units of measurement areimportant. Suppose we regress personal auto claim frequency on the distance thecar is driven.


If the slopecoefficient is â when the distance is in miles (or kilometers), the slopecoefficient is â × 1,000 when the distance is thousands of miles(kilometers).

If the slopecoefficient is â when the claim frequency is in claims per car, the slopecoefficient is â / 100 when the claim frequency is claims per hundredcars.


Illustration: Suppose the regression line is Y = 1+ 0 × X + å. N (number of points) = 1,000, theexplanatory variables are the integers from 1 to 1,000, and ó2ε = 1. The scatterplot is a horizontal line Y = 1 withslight random fluctuations above and below the line. The scatterplot shows aclear pattern; it is not a cloud of points. But R2 is close to zero,since the values of X have no effect on the values of Y.

Now suppose the true regression lineis Y = 1 + 1 × X + å, with N (numberof points) = 1,000, the explanatory variables are the integers from 1 to 1,000,and ó2ε = 1 million. The scatterplot is a 45̊ diagonal line Y =X with much random fluctuations above and below the line. The scatterplot doesnot show a clear pattern; it appears as a cloud of points, and only by lookingcarefully does one see the pattern. But R2 is not close to zero,since the values of X have a strong effect on the values of Y. The exact valueof R2 depends on the error terms.

Some statisticians do not much use R2,since it is a mix of two values: the slope of the regression line and the ratioof óε to the dispersion of the Y values. We do not use R2for goodness-of-fit tests or tests of significance, since it mixes two items.We use the t-value (or the F-value) for the significance of the explanatoryvariables.

Part E: If R2 is close to 1, thecorrelation of the explanatory variable and the response variable (X and Y) isclose to 1 or –1. Almost all the variation in the response variable isexplained by the explanatory variables.

An R2 is close to 1 impliesthat the ratio of óε to the dispersion of the Y values(the variance of Y) is low. Three things affect the R2.


RSS and ó2å are low.

â is not low.

TSS (the variance ofY) is high.


Part F: s2 is the ordinaryleast squares estimator of ó2ε. Most importantly, s2 is an unbiasedestimator of ó2ε.

Jacob: Does this imply that s is anunbiased estimator of óε?

Rachel: If s2 is anunbiased estimator of ó2ε, s is not an unbiased estimator of óε. To grasp the rationale for this, suppose ó2ε is 4 and s2 is 2, 3, 4, 5, or 6, with20% probability of each.


óå is 4 = 2.

s is 2, 3, 4, 5, or 6,with a 20% probability of each.

The mean of s is ( 2+ 3 + 4 + 5 + 6) / 5 = 1.966.

s is a reasonable estimator of óε, but it is not unbiased.


Part G: Use the relation F = [ (N – k –1) / q ] × R2 / (1 – R2), where k is thenumber of explanatory variables (not including the intercept) and q is thenumber of variables in the group being tested.

Jacob: How is this relation derived?

Rachel: Use the expression for the F-value interms of RSS and divide numerator and denominator by TSS.

Jacob: Fox has a q in his formula(page 109) and an R0. What is the difference between k and q,and what is R0?

Rachel: Fox shows the general form of theF-value. For the omnibus F-test, the null hypothesis is that all ß’s arezero, so k = q and R02 (the R2for the null hypothesis) = 0.

Jacob: Can you explain the intuition forthat last statement?

Rachel: If all ß’s are zero, RSS = TSS, andRegSS = 0.

Part H: The F-value measures if agroup of explanatory variables in combination is significant. The omnibus F-testmeasures if all the explanatory variables in combination are significant.

Jacob: Is that the same as at least oneexplanatory variable is significant? After all, if the explanatoryvariables in combination are significant, at least one of them must besignificant.

Rachel: No, that is not correct. A clearexample is a regression analysis on a group of correlated explanatoryvariables. Suppose an actuary regresses the loss cost trend for workers’compensation on three inflation indices: monetary inflation (the change in theCPI), wage inflation, and medical inflation. All three inflation indices arehighly correlated. If any one were used in the regression equation alone, itwould significantly affect the loss cost trend. If all three are used, we maynot be able to discern which affects the loss cost trend, and none might besignificant.

Jacob: If the regression equation has onlyone explanatory variable, are the t-value and the F-value thesame?

Rachel: They have the same p-values, and theyare equivalent significance tests, but they have different units. The F-valueis the square of the t-value.

Part I: If the F-value is close to zero, the slope coefficientis not significantly different from zero. This means one of three things:

1. The slope coefficient is close tozero. The slope coefficient â depends on theunits of measurement, so the term close to zero depends on the units ofmeasurement. To avoid problems with the units of measurement, assume the X andY values are normalized: deviations from the mean in units of the standarddeviation.

2. The variance of the error term ó2ε is large relative to the variance of the responsevariable. The random fluctuation in the residual variance overwhelms the effectof the explanatory variable.

3. The data sample has so few pointsthat the regression pattern is spurious. For example, one can draw a straightline connecting any two points, so the regression analysis means nothing. TheF-value has zero degrees of freedom and is not significant no matter how largeit is.


[The following exercise explains someintuition for R2, adjusted R2, F values, andsignificance.]

** Exercise 10.3: Measures ofsignificance

Two regression equations Y and Zregress inflation rates on interest rates using data from different periods.The true population distributions of the explanatory variable and the responsevariable are the same in the two equations.


Equation Y has thehigher R2 and an estimated slope coefficient of âY.

Equation Z has thehigher adjusted (corrected) R2 and an estimated slope coefficient ofâZ.



A. Whichregression equation uses a larger data set?

B. Whichregression equation has a greater F-value?

C. Whichis the better estimate of the slope coefficient: âY or âZ?

Part A: Equation Y has the higher R2and the lower adjusted (corrected) R2. This implies that Equation Yhas fewer data points, and more of its R2 is spurious.

Part B: The F-test uses the same adjustmentfor degree of freedom as the adjusted R2, so Equation Z has thehigher F-value.

Part C: âZ has the higher t-value (the squareroot of the F-value), so it is the better estimate. In practice, we would use aweighted average of the two ß’s, with more weight given to Equation Z.


** Exercise 10.4: R2

A simple (two-variable) linearregression model Yi = á+ â × Xi + åi is fit to the 5 points:

(0, 0), (1, 1), (2, 4), (3, 4), (4, 6)


A. Whatis the mean X value?

B. Whatis the mean Y value?

C. Whatare the five points in deviation form?

D. Whatis (xi)2?

E. Whatis (yi)2?

F. Whatis (xi)(yi)?

G. Whatis R2?

H. Whatis the adjusted (corrected) R2?

Part A: The mean X value () = (0 + 1 + 2 + 3 + 4) / 5 = 2

Part B: The mean Y value () = (0 + 1 + 4 + 4 + 6) / 5 = 3

Part C: For the deviations from the mean,subtract 2 from each X value and 3 from each Y value to get

(–2, –3), (–1, –2), (0, 1), (1, 1),(2, 3)

Part D: ∑(xi)2 = 4 + 1 + 0 + 1 + 4 = 10

Part E: ∑(yi)2 = 9 + 4 + 1 + 1 + 9 = 24

Part F: ∑(xi)(yi) = 6 + 2 + 0 + 1 + 6 = 15

Part G: The total sum of squares (TSS) = ∑(yi)2 = 9 + 4 + 1 + 1 + 9 = 24

The regression sum of squares (RegSS)= [ ∑(xi)(yi) ]2 / ∑(xi)2 = 152 / 10 = 22.5

The R2 = RegSS / TSS = 22.5/ 24 = 93.75%

Part H: Adjusted R2 = 1 – (1 – R2)× (N – 1) / (N – k) = 1 – (1 – 0.9375) × (5 – 1) / (5 – 2) = 0.917


** Question 10.5: Adjusted R2

We fit the model Yi = á + â1 X1i + â2 X2i + â3 X3i + â4 X4i + åi to N observations.


Y = the expectedvalue of R2

Z = the expectedvalue of the adjusted R2.


As N increases, which of the followingis true?


A. Yincreases and Z increases

B. Yincreases and Z decreases

C. Ydecreases and Z increases

D. Ydecreases and Z decreases

E. Ydecreases and Z stays the same

Answer 10.5: E

If N = 2, R2 = 100%, sincewe can fit a straight line connecting two points. As N increases, R2 declines to thesquare of the correlation between the population variables X and Y.

The adjusted R2 iscorrected for degrees of freedom, so its expected value is the square of thecorrelation between the variables X and Y, regardless of N.

Intuition: R2 is correct for largesamples and overstated for small samples.

The adjusted (corrected) R2is an unbiased estimate for all samples.


** Question 10.6: Adjusted R2

We estimate two regression equations,S and T, with a different number of observations and a different number ofindependent variables in each regression equation.


R2sand R2t are the R2 for equations S and T.

Ns and Ntare the number of observations for equations S and T.

Ks and Ktare the number of independent variables for equations S and T.


R2s = R2t.Under what conditions is the adjusted R2 for equation S definitelygreater than the adjusted R2 for equation T?


A. Ns> Nt and Ks > Kt

B. Ns< Nt and Ks < Kt

C. Ns> Nt and Ks < Kt

D. Ns< Nt and Ks > Kt

E. Inall scenarios, the adjusted R2 for equation S may be more or lessthan the adjusted R2 for equation T.

Answer 10.6: C

Use the formula for the adjusted R2in terms of R2, N, and k. Intuitively, the difference betweenthe R2 and the adjusted R2 decreases as the degrees offreedom increase.

Adjusted R2= 1 – (1 – R2) × (N – 1) / (N – k).

N is more than k. The value of (N-1)/(N-k)


decreases as Nincreases

increases as kincreases


As (N-1)/(N-k) decreases, theadjusted R2 increases. Choice C has these relations.


Attachments
Edited 13 Years Ago by NEAS
lms0123
Forum Newbie
Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)

Group: Forum Members
Posts: 6, Visits: 46
The solution for Exercise 10.1 Part D shows R^2 = RegSS/RSS appears to be mistated, and I believe it should be RegSS/TSS, which is equivalent to the second expression 1-RSS/TSS.

[NEAS: Thank you for pointing out the typo; the file has been corrected and re-uploaded.]
Edited 13 Years Ago by NEAS
GO
Merge Selected
Merge into selected topic...



Merge into merge target...



Merge into a specific topic ID...





Reading This Topic


Login
Existing Account
Email Address:


Password:


Social Logins

  • Login with twitter
  • Login with twitter
Select a Forum....













































































































































































































































Neas-Seminars

Search