Generalized Linear Regression (GLM)

 

 

·         The basic

·         Linear model

·         Logistic model

·         Stepwise selection of independent variables

 

 

The Basic

 

Generalized linear model (GLM) generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.  

 

Y = g(β 0 + β1*X1 + ... + β m*Xm)

 

In a GLM, each outcome of the dependent variables, Y, is assumed to be generated from a particular distribution in the exponential family.

 

Generalized linear model differs from the general linear model in two major respects: (1) the distribution of the dependent or response variable can be non-normal, and does not have to be continuous; (2) the dependent variable values are predicted from a linear combination of predictor variables, which are "connected" to the dependent variable via a link function.

In the general linear model the dependent variable values are expected to follow the normal distribution, and the link function is a simple identity function (i.e., the linear combination of values for the predictor variables is not transformed).  

The GLM consists of three elements:

1. A probability distribution from the exponential family, such as normal, binomial, poisson, Gamma, etc.

2. A linear predictor η = Xβ .

3. A link function g such that E(Y) = μ = g-1(η).

 

Distribution and Link Function

Following is a table of commonly used distribution and link functions.

Distribution

Name

Link Function

Mean Function

Normal

Identity

\mathbf{X}\boldsymbol{\beta}=\mu\,\!

\mu=\mathbf{X}\boldsymbol{\beta}\,\!

Exponential

Inverse

\mathbf{X}\boldsymbol{\beta}=\mu^{-1}\,\!

\mu=(\mathbf{X}\boldsymbol{\beta})^{-1}\,\!

Gamma

Inverse
Gaussian

Inverse
squared

\mathbf{X}\boldsymbol{\beta}=\mu^{-2}\,\!

\mu=(\mathbf{X}\boldsymbol{\beta})^{-1/2}\,\!

Poisson

Log

\mathbf{X}\boldsymbol{\beta}=\ln{(\mu)}\,\!

\mu=\exp{(\mathbf{X}\boldsymbol{\beta})}\,\!

Binomial

Logit

\mathbf{X}\boldsymbol{\beta}=\ln{\left(\frac{\mu}{1-\mu}\right)}\,\!

\mu=\frac{\exp{(\mathbf{X}\boldsymbol{\beta})}}{1 + \exp{(\mathbf{X}\boldsymbol{\beta})}} = \frac{1}{1 + \exp{(-\mathbf{X}\boldsymbol{\beta})}}\,\!

Multinomial

 

Linear model

 

Linear model here refers to the dependent variable (Y) follows normal distribution and the link function is identity.  The formula is:

 

Y = β0 + β 1X1 + β 2X2 + ... + β kXk

 

For example, we want to model systolic blood pressure (SBP) with age, sex, height, weight, smoking status (SMOKE), occupation (OCCU) and education (EDU).  The model could be written as:

 

SBP = β0 + β 1*Age + β 2*Sex + β 3*Height + β 4*Weight + β 5*SMOKE + β 6*OCCU + β 7*EDU

 

After you specify the dependent variable (Y), Empower will automatically check the variable type. If it is a continuous variable, Empower will use normal distribution and identity link function as the default. 

 

Below is the sample input window

 

 

 

Below is the sample output of the above model:

 

Genelineralized Linear Models

 

Call:

glm(formula = SBP ~ AGE + SEX + HEIGHT + WEIGHT + SMOKE.NEW +

    OCCU + factor(EDU.NEW), family = gaussian(link = "identity"),

    data = WD, na.action = na.omit)

 

Deviance Residuals:

     Min        1Q    Median        3Q       Max 

-14.3760   -3.9627    0.0851    3.8310   15.6338 

 

Coefficients:

                  Estimate Std. Error t value Pr(>|t|)   

(Intercept)      130.61781    7.31844  17.848  < 2e-16 ***

AGE                0.71385    0.02234  31.948  < 2e-16 ***

SEX               -4.11534    0.76099  -5.408 9.00e-08 ***

HEIGHT           -25.94509    4.62962  -5.604 3.10e-08 ***

WEIGHT             0.35255    0.03717   9.485  < 2e-16 ***

SMOKE.NEW         -1.65419    0.60788  -2.721  0.00668 **

OCCU               0.39372    0.42813   0.920  0.35811   

factor(EDU.NEW)2   0.01309    0.55348   0.024  0.98114   

factor(EDU.NEW)3  -0.51641    0.67281  -0.768  0.44304   

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

(Dispersion parameter for gaussian family taken to be 27.91044)

 

    Null deviance: 57127  on 652  degrees of freedom

Residual deviance: 17974  on 644  degrees of freedom

AIC: 4037.9

 

Number of Fisher Scoring iterations: 2

 

Explanation of the output:

 

In the above output, the “Coefficients” section lists the regression coefficient (the βs) and its significance test. 

 

For example:

 

The β for AGE is 0.71385, which means each 1 year increase of age will cause SBP increase 0.71385 mmhg.  The significance test for β is to answer whether the β was significantly different from 0. If it is not significantly different from 0, implies age change does not cause SBP change.  In this example, the p value for Age is < 2e-16 (0.0000000000000002), which is very small, means if h0 (AGE change does cause SBP change) is true, the probability to get such size a sample with the β is as high as 0.71385 is less than 2e-16.

 

The β for SMOKE (β 5 in the model) is -1.65419, which means smoker (SMOKE.NEW=1) has SBP 1.65419 mmhg lower than non-smoker (SMOKE.NEW=0), the p value is 0.00668, which is significant different from 0.

 

Besides the model output, Empower will also output following plots:

 

(1)   QQ plot of the residual, which is for exam whether the residuals follow normal distribution. Theoretically, linear regression does not require the dependent variable (Y) fit normal distribution, but requires the residual fits normal distribution.

 

(2)   Scatter plot of residual versus fitted value, which is for exam whether the variance of residuals was independent of fitted value. Plotting residuals versus the value of a fitted response should produce a distribution of points scattered randomly about 0, regardless of the size of the fitted value.

 

(3)   Term plot for each independent variable, which is for visualizing the fitted line and partial residuals for each independent variable. These plots allow us to study the marginal relationship of the independent and dependent variable given other variables are in the model.

 

 

 

 

 

Logistic model

 

Logistic model here refers to the dependent variable (Y) follows binomial distribution and the link function is logit, which is log(p/(1-p).  The formula is:

 

Logit(Y) = β0 + β 1X1 + β 2X2 + ... + β kXk

 

For example, we want to model the disease hypertension (HBP, 0=no 1=yes) with age, height, weight and smoking status (SMOKE).  The model could be written as:

 

Logit(HBP) = β0 + β 1*Age + β 2*Height + + β 3*Weight  + β 4*SMOKE

 

After you specify the dependent variable (Y), Empower will automatically check the variable type. If it is a dichotomous variable, Empower will use binomial distribution and logit link function as the default. 

 

Below is the sample input window

 

 

In the above screen shot, we limited the population to male and with education level is “primary school” (see the “Select population” box: SEX=1 AND EDU=1).

 

Below is the sample output of the above model:

 

Genelineralized Linear Models
                                      
 Using subset of data: SEX=1 AND EDU=1
 
Call:
glm(formula = HBP ~ AGE + HEIGHT + WEIGHT + SMOKE.NEW, family = binomial(link = "logit"), 
    data = WD, na.action = na.omit)
 
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1713  -0.3976  -0.1179   0.2905   2.2252  
 
Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -47.92649   20.38250  -2.351  0.01871 * 
AGE           0.33504    0.10340   3.240  0.00119 **
HEIGHT       18.57273   11.54518   1.609  0.10768   
WEIGHT        0.02858    0.09031   0.316  0.75163   
SMOKE.NEW    -2.07438    1.86563  -1.112  0.26618   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
 
(Dispersion parameter for binomial family taken to be 1)
 
    Null deviance: 62.480  on 51  degrees of freedom
Residual deviance: 31.429  on 47  degrees of freedom
AIC: 41.429
 
Number of Fisher Scoring iterations: 7

 

              Odds ratio    Low 95%CI   High 95%CI     P value

(Intercept) 1.533869e-21 6.852579e-39 3.433383e-04 0.018705176

AGE         1.397994e+00 1.141538e+00 1.712066e+00 0.001194272

HEIGHT      1.164216e+08 1.732113e-02 7.825124e+17 0.107682063

WEIGHT      1.028996e+00 8.620605e-01 1.228258e+00 0.751628233

SMOKE.NEW   1.256339e-01 3.243805e-03 4.865855e+00 0.266182487

 

 

Explanation of the output

 

In the above output, the “Coefficients” section lists the regression coefficient (the βs) and its significance test. 

 

For example:

 

The β for AGE is 0.33504, which means each 1 year increase of age will cause logit(HBP) increase 0.33504.  Logit(HBP) is log(p/(1-p)), p is the rate of hypertension (HBP),  p/(1-p) is also called odds. So p/(1-p) = eβ = exp(0.33504) =  1.40. Which the odds ratio of hypertension for each year increases of age is 1.40.

 

The significance test for β is to answer if the β was significantly different from 0, if it is not significantly different from 0, means the odds ratio is 1, means age does not cause increase of hypertension.  In this example, the p value for Age is 0.00119, which means if h0 (AGE does not cause hypertension) is true, the probability to get such a sample with the odds ratio is as high as 1.40 is 0.00119, so we reject the h0 and accept that age does cause hypertension.

 

The β for SMOKE (β 5 in the model) is -2.07438, calculate the odds ratio we get 0.13 (e-2.07438), which means compare smoker (SMOKE.NEW=1) to non-smoker (SMOKE.NEW=0), the odds ratio of hypertension is 0.13.  The p value is 0.26618, which is not significant different from 0, means smoker does not have lower odds of hypertension.

 

Besides the model output, Empower will also output following plots:

 

(1)    QQ plot of the residual, which is for visualizing the residuals against normal distribution. Theoretically, the residual should fit normal distribution, if it is not, the model does not fit appropriate.

 

(2)    Scatter plot of residual versus fitted value, which is for exam whether the variance of residuals was independent of fitted value. Plotting residuals versus the value of a fitted response should produce a distribution of points scattered randomly about 0, regardless of the size of the fitted value.

 

(3)   Term plot for each independent variable, which is for visualizing the fitted line and partial residuals for each term (independent variable). These plots allow us to study the marginal relationship of the independent and dependent variable given other variables are in the model

 

 

 

 

 

Stepwise selection of independent variables (Empower(R) only)

 

After you select a list of independent variables (Xs), you can instruct Empower to put all the independent variables in the model, or use a selection method to select variables.

 

The selection method includes: “Stepwise selection: forward/backward”, “Stepwise selection: backward only”, “Stepwise selection: forward only”, “Exhaustive search best subsets“.

 

If you select “Exhaustive search best subset”, Empower will search all possible subsets, print the top fitted models for each size (number of independent variables in model) and then give a best fitted model that has a maximum adjusted R2.

 

Maximum size of subsets to examine” defines the maximum number of independent variables should be included in the search (default = 8). “Number of subset of each size” tells for each model size the number of top fitted subsets to print (default=3).  You can change the default by click the number.

 

For example, you listed 10 Xs (independent variables), The method first searches for model size 1, which is just put one X in model, there are 10 possible models with size =1, it just prints the top 3 best fitted models for size=1; then searches for model size 2 (put 2 Xs in model), there are 45 (9+8+…+1) possible models for size=2, it just prints the top 3 best fitted models for size=2; and so on, The method searches for model size up to 8.

 

Variables forced in” lists the variables that you want them always be included in the model.

 

A screen shot of sample input window:

 

 

 

In the above screen shot, we limited the population to male and with education level is “primary school” (see the “Select population” box: SEX=1 AND EDU=1).

 

Below is the sample output of the above model and explanation:

 
Genelineralized Linear Models
                                          
(1)     Empower first lists the model and subset condition of data used:
 
 Using subset of data: SEX=1 AND EDU.NEW=1
Subset selection object
Call: regsubsets.formula(SBP ~ AGE + HEIGHT + WEIGHT + SMOKE.NEW + 
    PSMK + OCCU, data = WD, nbest = 3, nvmax = 8, intercept = TRUE, 
    method = c("exhaustive"))
6 Variables  (and intercept)
 
(2)     Then, lists the variables forced in. In this example no variables were forced in
 
          Forced in Forced out
AGE           FALSE      FALSE
HEIGHT        FALSE      FALSE
WEIGHT        FALSE      FALSE
SMOKE.NEW     FALSE      FALSE
PSMK          FALSE      FALSE
OCCU          FALSE      FALSE
3 subsets of each size up to 6
 
(3)     Then, lists the models. The first column is the model size (number of independent variables). Column 3 to 8 lists whether the variable was in the model. Each column is for a variable. For example, the last model includes all the variables.
 
Selection Algorithm: exhaustive
  (Intercept)   AGE HEIGHT WEIGHT SMOKE.NEW  PSMK  OCCU
1        TRUE  TRUE  FALSE  FALSE     FALSE FALSE FALSE
1        TRUE FALSE  FALSE  FALSE     FALSE FALSE  TRUE
1        TRUE FALSE  FALSE  FALSE     FALSE  TRUE FALSE
2        TRUE  TRUE  FALSE   TRUE     FALSE FALSE FALSE
2        TRUE  TRUE   TRUE  FALSE     FALSE FALSE FALSE
2        TRUE  TRUE  FALSE  FALSE      TRUE FALSE FALSE
3        TRUE  TRUE  FALSE   TRUE      TRUE FALSE FALSE
3        TRUE  TRUE  FALSE   TRUE     FALSE FALSE  TRUE
3        TRUE  TRUE   TRUE   TRUE     FALSE FALSE FALSE
4        TRUE  TRUE  FALSE   TRUE      TRUE FALSE  TRUE
4        TRUE  TRUE   TRUE   TRUE      TRUE FALSE FALSE
4        TRUE  TRUE  FALSE   TRUE      TRUE  TRUE FALSE
5        TRUE  TRUE   TRUE   TRUE      TRUE FALSE  TRUE
5        TRUE  TRUE  FALSE   TRUE      TRUE  TRUE  TRUE
5        TRUE  TRUE   TRUE   TRUE      TRUE  TRUE FALSE
6        TRUE  TRUE   TRUE   TRUE      TRUE  TRUE  TRUE
 
(4)     Next, Empower lists each model’s regression coefficient (β) for each variable.  For example model 4 has age and weight in the model.  The β for age is 0.7812490 and for weight is 0.3534581. 
 
[[1]]
(Intercept)         AGE 
100.8668098   0.7310581 
 
[[2]]
(Intercept)        OCCU 
 130.666667    5.657658 
 
[[3]]
(Intercept)        PSMK 
    133.125       4.075 
 
[[4]]
(Intercept)         AGE      WEIGHT 
 78.7746545   0.7812490   0.3534581 
 
[[5]]
(Intercept)         AGE      HEIGHT 
 62.5399832   0.7413932  23.1571814 
 
[[6]]
(Intercept)         AGE   SMOKE.NEW 
103.4007422   0.7391932  -3.2199478 
 
[[7]]
(Intercept)         AGE      WEIGHT   SMOKE.NEW 
 81.3813561   0.7843681   0.3381378  -2.0956137 
 
[[8]]
(Intercept)         AGE      WEIGHT        OCCU 
 78.0999803   0.7975870   0.3641228  -0.9525526 
 
[[9]]
(Intercept)         AGE      HEIGHT      WEIGHT 
 83.6523120   0.7822019  -3.6315460   0.3715823 
 
[[10]]
(Intercept)         AGE      WEIGHT   SMOKE.NEW        OCCU 
 80.7710076   0.8029080   0.3492548  -2.2145532  -1.0706149 
 
[[11]]
(Intercept)         AGE      HEIGHT      WEIGHT   SMOKE.NEW 
 85.5411525   0.7851607  -3.1119559   0.3537863  -2.0795501 
 
[[12]]
(Intercept)         AGE      WEIGHT   SMOKE.NEW        PSMK 
81.35467608  0.78413073  0.33864605 -2.09271608  0.01719729 
 
[[13]]
(Intercept)         AGE      HEIGHT      WEIGHT   SMOKE.NEW        OCCU 
 85.3711737   0.8039693  -3.4459190   0.3666927  -2.1979434  -1.0812163 
 
[[14]]
(Intercept)         AGE      WEIGHT   SMOKE.NEW        PSMK        OCCU 
 80.3486561   0.8003185   0.3572758  -2.1777051   0.2530742  -1.1227655 
 
[[15]]
(Intercept)         AGE      HEIGHT      WEIGHT   SMOKE.NEW        PSMK 
85.95620814  0.78629428 -3.33183937  0.35258436 -2.09157123 -0.07808025 
 
[[16]]
(Intercept)         AGE      HEIGHT      WEIGHT   SMOKE.NEW        PSMK 
 84.4849712   0.8021327  -2.9892799   0.3696350  -2.1760119   0.1657438 
       OCCU 
 -1.1139660 
 
(5)     Next, Empower lists the adjusted R2 for each model. For example, model 4 R2=0.71533943.  Adjusted R2 is a version of R-Squared that has been adjusted for the number of predictors in the model.  

http://mtsu32.mtsu.edu:11308/dictionary/images/formulaimag/formul63.gif

(N is the sample size, k is the number of independent variables)
                  
 Adjusted R-Square
 [1] 0.66051049 0.06113200 0.02853568 0.71533943 0.67193291 0.66485058
 [7] 0.71416297 0.71145582 0.70970394 0.71070634 0.70830223 0.70808199
[13] 0.70469383 0.70455798 0.70197384 0.69818817
 
(6)     Next, Empower lists the model with maximum adjusted R2, in this example the model 4 has highest adjusted R2.                                     
 
 Model with maximum adjusted R-square
 
Call:
glm(formula = tmp.fmidx, family = gaussian(link = "identity"), 
    data = WD, na.action = na.omit)
 
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-9.3738  -3.3896   0.5106   2.8416   9.6112  
 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 78.77465    7.47960   10.53 3.50e-14 ***
AGE          0.78125    0.06862   11.39 2.28e-15 ***
WEIGHT       0.35346    0.10841    3.26  0.00203 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
 
(Dispersion parameter for gaussian family taken to be 23.97331)
 
    Null deviance: 4295.1  on 51  degrees of freedom
Residual deviance: 1174.7  on 49  degrees of freedom
AIC: 317.68
 
Number of Fisher Scoring iterations: 2
 
Response variable: SBP 
Total response variance: 84.2172 
Analysis based on 52 observations 
 
2 Regressors: 
AGE WEIGHT 
Proportion of variance explained by model: 72.65%
Metrics are normalized to sum to 100% (rela=TRUE). 
 
Relative importance metrics: 
 
             lmg       last       first
AGE    0.9571469 0.92421145 0.995627001
WEIGHT 0.0428531 0.07578855 0.004372999
 
Average coefficients for different model sizes: 
 
               1X       2Xs
AGE    0.73105814 0.7812490
WEIGHT 0.07654654 0.3534581

 

 

(7)   Empower also plots the relative importance of each independent variable.

 

The relative importance could be calculated in 3 different methods. Method “first” completely ignores other independent variables in the model, no adjustment takes place.  Method “last” is the increase in R2 when including this variable as the last of independent variables.  Method “lmg” is the average over average contributions in models of different sizes. This method requires far more computational effort. It decomposes R2 into non-negative contributions that automatically sum to the total R2. This is an advantage over simple method (first and last).  For details, refer to this paper: “Relative Importance for Linear Regression in R: The Package relaimpo