Robust Regression

Robust regression can be used in any situation in which you would use ordinary linear regression.  When doing the regression diagnostics, you might discover that one or more data points are moderately outlying.  These are points that you have no compelling reason to exclude them from the analysis.  Robust regression is a compromise between deleting these points, and allowing them to violate the assumptions of ordinary regression.

Some terms:

Residual:  The difference between the predicted value (based on the regression equation) and the actual, observed value.

Outlier:  In linear regression, an outlier is an observation with large residual. 

Leverage:  An observation with an extreme value on a predictor variable is a point with high leverage.  Leverage is a measure of how far an independent variable deviates from its mean.  These leverage points can have an effect on the estimate of regression coefficients.

Influence:  An observation is said to be influential if removing the observation substantially changes the estimate of coefficients.  Influence can be thought of as the product of leverage and outlierness

Robust regression deals with cases that have very high leverage, and cases that are outliers. 

Robust regression is essentially a form of weighted least squares regression and is done iteratively. At each step a new set of weights are determined based on the residuals. In general, the larger the residuals, the smaller the weights. So the weights depend on residuals. At the same time, the residuals depend on the model and the model depends on the weights. This generates an iteration process and it goes on until the change in the parameter estimates are small enough.

A screen shot of sample input window

ScreenHunter_03 Jan. 03 23.19.gif

A sample output:

This module will give output of robust regression first. It also gives ordinary least square regression for comparison purpose. If subject ID variable is given, it will list top 10 IDs with least weight. These IDs are most likely outliers.  The module also plots Cook’s distance, residuals vs fitted, residual-QQ, and scale-locations.  For Cook’s distance, see COOK, R. D. and WEISBERG, S. (1982) Residuals and Influence in Regression. London: Chapman and Hall.

demo_7_rbreg_CooksD.pngdemo_7_rbreg_residual.png

demo_7_rbreg_ScaleLoc.pngdemo_7_rbreg_QQ.png

 

Robust Regression Using R

                                                       

 Robust Regression (iterated re-weighted least squares)

 

Call: rlm(formula = SBP ~ AGE + BMI + factor(EDU.NEW) + SMOKE.NEW +

    PSMK + ALH + OCCU + SEX, data = WD)

Residuals:

     Min       1Q   Median       3Q      Max

-44.3657  -9.8958  -0.9605   9.8967 103.8504

 

Coefficients:

                 Value   Std. Error t value

(Intercept)      97.9449  7.2383    13.5315

AGE               0.6625  0.0513    12.9202

BMI               0.6145  0.2646     2.3222

factor(EDU.NEW)2  0.2750  1.6384     0.1679

factor(EDU.NEW)3  1.1896  1.9666     0.6049

SMOKE.NEW        -3.1688  1.6825    -1.8834

PSMK              0.2829  1.3219     0.2140

ALH              -3.1881  1.8586    -1.7154

OCCU              4.4803  1.2434     3.6031

SEX              -6.0733  1.8658    -3.2550

 

Residual standard error: 14.68 on 774 degrees of freedom

                                

 Ordinary Least Square Regression

 

Call:

lm(formula = SBP ~ AGE + BMI + factor(EDU.NEW) + SMOKE.NEW +

    PSMK + ALH + OCCU + SEX, data = WD)

 

Residuals:

    Min      1Q  Median      3Q     Max

-49.302 -12.152  -2.344   8.843  95.640

 

Coefficients:

                 Estimate Std. Error t value Pr(>|t|)   

(Intercept)       90.7816     8.5542  10.612  < 2e-16 ***

AGE                0.8119     0.0606  13.399  < 2e-16 ***

BMI                0.7814     0.3127   2.499  0.01268 * 

factor(EDU.NEW)2   0.3358     1.9362   0.173  0.86235   

factor(EDU.NEW)3   0.3854     2.3242   0.166  0.86836   

SMOKE.NEW         -3.6554     1.9884  -1.838  0.06639 . 

PSMK              -0.3858     1.5622  -0.247  0.80501   

ALH               -2.0423     2.1965  -0.930  0.35277   

OCCU               4.4871     1.4695   3.053  0.00234 **

SEX               -5.6099     2.2050  -2.544  0.01115 * 

---

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

Residual standard error: 19.74 on 774 degrees of freedom

Multiple R-squared: 0.2417,        Adjusted R-squared: 0.2329

F-statistic: 27.41 on 9 and 774 DF,  p-value: < 2.2e-16

 

                                      

 Top 10 observations with least weight

    Cooks.distance std.residual    weight SUBJ SBP  AGE      BMI EDU.NEW

378    0.044949304     4.889954 0.1901728  378 255 68.6 25.09852       1

515    0.027214392     4.458919 0.2131029  508 234 60.5 19.95936       2

287    0.028272829     4.035681 0.2301041  289 226 63.9 19.60519       2

53     0.010506557     3.644830 0.2595354   52 210 54.4 19.17458       1

751    0.025748526     3.597011 0.2619919  743 220 59.4 21.33458       1

679    0.009101414     3.545937 0.2672411  663 211 52.1 19.56086       1

118    0.014763596     3.533448 0.2632774  117 219 63.7 18.26150       1

574    0.014587082     3.331341 0.2864147  569 206 51.4 22.34778       2

345    0.016523152     3.314122 0.2743174  347 213 64.7 20.54569       1

156    0.016603603     3.299048 0.2845477  155 208 57.4 21.43600       1

    SMOKE.NEW PSMK ALH OCCU SEX

378         0    0   0    1   2

515         1    0   0    0   1

287         1    0   1    0   1

53          0    1   0    0   2

751         0    1   0    0   1

679         0    1   0    1   2

118         0    1   0    1   2

574         1    1   0    0   1

345         1    0   1    0   1

156         1    1   0    1   2