Regression analysis Module 22 Poisson GLM practice problems
(The attached PDF file has better formatting.)
** Exercise 22.1: Poisson GLM
An actuary uses a Poisson GLM with a log-link function to relate claim frequency to sex (male vs female) and location (urban vs rural), using dummy regressors of
sex: 0 = female and 1 = male
location: 0 = rural and 1 = urban
The estimates for the coefficients are
â
1 = sex (male) = 0.2
â
2 = location (urban) = 0.3
If female-rural drivers have expected claim frequencies of 5%, what are the expected claim frequencies for
male-rural drivers
female-urban drivers
male-urban drivers
Part A:
The expected claim frequency for any driver is exp(
á + â1 × sex + â2 × location)
The expected claim frequency for female-rural drivers is exp(
á + 0.2 × 0 + 0.3 × 0) = exp(á).
The expected claim frequency for male-rural drivers is exp(
á + 0.2 × 1 + 0.3 × 0) = exp(á + 0.2) = exp(á) × e0.2 = exp(á) × 1.22140 = 5% × 1.22140 = 6.11%.
Part B:
The expected claim frequency for male-rural drivers is exp(
á + 0.2 × 0 + 0.3 × 1) = exp(á + 0.3) = exp(á) × e0.3 = exp(á) × 1.34986 = 5% × 1.34986 = 6.75%.
Part C:
The expected claim frequency for male-urban drivers is exp(
á + 0.2 × 1 + 0.3 × 1) = exp(á + 0.2 + 0.3) = exp(á) × e0.2 × e0.3 = exp(á) × 1.22140 × 1.34986 = 5% × 1.34986 = 8.24%.
Jacob:
What is the value of
á?
Rachel:
eá = 5%
á = ln(5%) = -2.99573.
** Exercise 22.2: Poisson GLM
An actuary uses a Poisson GLM with a log-link function to relate claim frequency to sex (male vs female) and location (urban vs rural), using dummy regressors of sex (male vs female) and location (urban vs rural).
The maximized likelihoods using neither sex nor location, sex only, location only, or both sex and location are
H0: neither sex nor location: 0.0001
H1: sex only: 0.0003
H2: location only: 0.0019
H3: both sex and location: 0.0021
H4: saturated model: 0.0028
The actuary uses a 5% significance level to test hypotheses, and
÷20.05, 1 (÷2 test with 1 degree of freedom at a 5% significance level) = 3.84.
What are the residual deviances for models H0, H1, H2, H3, and H4?
Is sex plus location a significant improvement on sex alone?
Is sex plus location a significant improvement on location alone?
What is the (pseudo-) R2 for the models using sex only, location alone, and sex plus location?
Would the answers to Parts B and C change if the saturated model has a greater maximized likelihood?
Part A:
The residual deviance is 2 × ( ln(Ls) – ln(Lm) ), where
Lm = the maximized likelihood under the model in question.
Ls = the maximized likelihood under a saturated model.
The residual deviances for the four models here are
H0: neither sex nor location: 2 × (ln(0.0028) – ln(0.0001) ) = 6.664
H1: sex only: 2 × (ln(0.0028) – ln(0.0003) ) = 4.467
H2: location only: 2 × (ln(0.0028) – ln(0.0019) ) = 0.776
H3: both sex and location: 2 × (ln(0.0028) – ln(0.0021) ) = 0.575
H3: saturated model: 2 × (ln(0.0028) – ln(0.0028) ) = 0.000
The residual deviance for the saturated model is always zero.
Part B:
The difference in the residual deviances for two models (G02), one of which is nested within the other, has a
÷2 distribution with degrees of freedom equal to the number of extra parameters in the larger model. For sex only vs sex plus location, G02 = the difference in the residual deviances = 4.467 – 0.575 = 3.892, which is more than ÷20.05, 1 (÷2 test with 1 degree of freedom at a 5% significance level), which is 3.84. We reject the smaller model (sex alone) in favor of the larger model (sex plus location).
Jacob:
What does this significance test mean?
Rachel:
If the
÷2 value is 3.84, the probability of a difference in residual deviances of 3.84 by chance alone if the smaller model (sex alone) is true is 5%. Since the difference in residual deviances is more than 3.84, the probability of the smaller model (sex alone) being true is less than 5%.
Part C:
For location only vs sex plus location, the difference in the residual deviances is 0.776 – 0.575 = 0.201, less than
÷20.05, 1 (÷2 test with 1 degree of freedom at a 5% significance level), which is 3.84. We do not reject the smaller model (location alone) in favor of the larger model (sex plus location).
Jacob:
Can we use this significance test to compare H1 (sex only) with H2 (location only)?
Rachel:
We use this significance test to compare nested models, not models with unrelated explanatory variables. H1 (sex only) is not nested in H2 (location only) and H2 is not nested in H1, so we can’t compare them. In this example, if H1 had a residual deviance of 0.776, just like H2, both explanatory variables may be significant; comparing the residual deviances of these two models doesn’t tell us anything.
Jacob:
Why do GLMs use the
÷2 distribution whereas classical regression analysis uses the F distribution?
Rachel:
Both GLMs and classical regression analysis use both the
÷2 distribution and the F distribution. Use the ÷2 distribution if the variance is known; use the F distribution if the variance must be estimated from the data. For the binomial and Poisson conditional distributions of the response variable, the variance is known if the mean is known:
Binomial distribution:
ó2 = ì × (1 – ì)
Poisson distribution:
ó2 = ì
For the normal distribution assumed by classical regression analysis, the variance is unrelated to the mean.
If one uses a pseudo-Poisson or pseudo-binomial distribution for the GLM, and the dispersion parameter must be estimated from the data, one uses an F distribution, not a
÷2 distribution.
Part D:
We define D0 as the residual deviance for the model including only the regression constant
á (termed the null deviance) and D1 the residual deviance for the model in question. The (pseudo-) R2 = 1 – D1 / D0 represents the proportion of the null deviance accounted for by the model.
For the model using sex only, the R2 is 1 – 4.467 / 6.664 = 32.97%.
For the model using location only, the R2 is 1 – 0.776 / 6.664 = 88.36%.
For the model using sex plus location, the R2 is 1 – 0.575 / 6.664 = 91.37%.
Jacob:
Why is this called the pseudo-R2 instead of the R2?
Rachel:
The R2 for classical regression is the square of the correlation between the response variable and the explanatory variable. The pseudo- R2 for a GLM is not the square of the correlation. The R2 for classical regression analysis is the percentage of the RSS accounted for by the regression equation, and the pseudo-R2 for the GLM is the percentage of the residual deviance account for by the GLM.
Part E:
The maximized likelihood for the saturated model does not affect the G02 used to compare nested models. If the maximized log-likelihoods for two models W and Z are Lw and Lz, and the maximized log-likelihood for the saturated model is Ls,
Gw,z2 = 2 × ( ln(Ls) – ln(Lw) ) – 2 × ( ln(Ls) – ln(Lz) ) = 2 × ( ln(Lz) – ln(Lw) )