This candidate regresses a team’s winning percentage on five explanatory variables. He starts with all five variables and eliminates the variable with the highest p-value one at a time, checking the effect on the adjusted R2. He concludes that only two of the five explanatory variables are good predictors of a team’s winning percentage.
This student project is analogous to actuarial pricing, and the baseball statistics provide a good data set to practice the statistical techniques. Pricing actuaries may start with a dozen explanatory variables for a driver’s loss frequency or a policyholder’s mortality rate. We use categorical (qualitative) explanatory variables, and we now use generalized linear models (GLMs) instead of classical regression analysis, but the concepts are the same. We eliminate one explanatory variable at time, to identify those variables that best predict future loss costs or mortality.
For your own student project, consider several ways to improve on the analysis here:
After reducing the full regression equation with five explanatory variables to a reduced regression equation with two or three explanatory variables, use an F test to see if the combination of excluded variables can improve the regression analysis.
Consider the multicollinearity among the explanatory variables. Batting averages and strike-outs are correlated. You may find it better to use uncorrelated explanatory variables, such as batting averages and a measure of pitching performance, such as runs given up or strike outs by the pitcher.
Compare the final regression equations for two time periods, such as 1910 to 1960 vs 1961 to 2000. You don’t have to use such large periods; you can use a sample of years in each period. After fitting the regression equation for each time period, use an F test to see if the differences are random fluctuation.
If you want to find other sports data, post a question on the discussion forum, such as "Where can I find batting averages for players on a particular team or for teams in a League?" Many sports web sites have this information, and other candidates may quickly direct you to a suitable data source.
If you have done a GLM analysis for your company, you can adapt that analysis for your student project. Explain the hypotheses, the techniques, the statistical tests, and the results. If you have not learned GLM analysis, use classical regression analysis and sports data.