Tree-Based Machine Learning Methods for Prediction, Variable Selection and Causal Inference

Part II

Hemant Ishwaran and Min Lu
University of Miami

Inference

Family Terminal Node Statistics, Prediction
Regression
Quantile Regression
mean, mean
quantiles, moments, mean
Classification
Imbalanced Two-Class
class proportions, class proportions, Bayes classifier
class proportions, class proportions, q-classifier
Survival Kaplan-Meier survival, Nelson-Aalen CHF, mortality
Competing Risk cause-specific CHF, cause-specific CIF,
event-specific expected number of years lost
Multivariate Regression
Multivariate Classification
\(\quad\)
Multivariate Mixed Regression
Multivariate Quantile Regression
per response: mean, mean
per response: class proportions,
class proportions, Bayes classifier
same as above for Regression, Classification
per response: quantiles, mean
Unsupervised
sidClustering
Breiman (Shi-Horvath)
none
same as Multivariate Mixed Regression
same as Classification

Inference

RF actually produces two ensembles!

The inbag ensemble is the averaged ensemble over the trained trees, where each tree is trained by a subsampled (bootstrap data) \[ F_{\!\text{ens}}(\mathbf{x}) =\frac{1}{B}\sum_{b=1}^B h^*_b(\mathbf{x}) \]

The out-of-bag (OOB) ensemble is case-specific. For case \(i\), the average is taken over trees where \(i\) is OOB \[ F_{\!\text{ens}}^{(i)} =\frac{1}{\#O_i}\sum_{b\in O_i} h^*_b(\mathbf{x}_i),\hskip10pt \#O_i\approx .368 B \]

The inbag ensemble is used for prediction on test data. The OOB ensemble is used for inference

OOB ensemble

  1. Equivalent to a leave-one-out (loo) bootstrap estimator [1]. Therefore it can be used for estimating generalization error \[ \text{Err} = \frac{1}{n}\sum_{i=1}^n L(Y_i, F_{\!\text{ens}}^{(i)}) \] For example, for regression \[ \text{Prediction Error} = \frac{1}{n}\sum_{i=1}^n (Y_i-F_{\!\text{ens}}^{(i)})^2 \]
  1. Avoids the need for cross-validation. Very conveniently for prediction error and for tuning the forest.
  1. Provides cross-validated estimator for model – a unique feature not seen with other ML methods

OOB classification example: Glioma

y cg11012046 cg21859434 cg23970089 cg25176823 cg24576735 age cg19099850 cg17403875 cg24078828
Mesenchymal-like 0.07659845 0.013786153 0.3651860 0.32445298 0.03255164 44 0.04020459 0.01333304 0.02021659
Mesenchymal-like 0.15108266 0.009155272 0.6024613 0.36269407 0.02215452 50 0.03771707 0.01045362 0.02098629
Classic-like 0.42735137 0.380247358 0.1984322 0.27367605 0.04060163 61 0.03648317 0.01304019 0.02570674
G-CIMP-low 0.93509040 0.861978174 0.5049500 0.47706717 0.03564740 20 0.88262115 0.62531752 0.42465204
LGm6-GBM 0.04602123 0.804717273 0.2564513 0.03827349 0.03625206 18 0.82269945 0.80449412 0.78443442

The key quantities returned by the package are

o$predicted     --->   inbag estimated probabilities
o$predicted.oob --->   OOB estimated probabilities

o$class         --->   inbag class predictions
o$class.oob     --->   OOB class predictions


o <- rfsrc(y~., data = glioma)
mean(o$class!=o$yvar)
> [1] 0
# inbag misclassification is zero!
mean(o$class.oob!=o$yvar)
> [1] 0.0671982

OOB classification example: Glioma

Tune nodesize

rfsrc(

formula, data, ntree = 500, mtry = NULL, ytry = NULL, nodesize = NULL, …)

nodesize sets the minumum size of terminal, default values are:


  15 for survival
  15 for competing risk
  5 for regression
  1 for classification
  3 for mixed outcomes
  3 for unsupervised


We can use tune.nodesize()

o <- tune.nodesize(y~., data = glioma)
o$nsize.opt
> [1] 3
nodesize =  1    error = 11.95% 
nodesize =  2    error = 8.85% 
nodesize =  3    error = 8.41% 
nodesize =  4    error = 12.83% 
nodesize =  5    error = 11.5% 
nodesize =  6    error = 14.6% 
nodesize =  7    error = 12.83% 
nodesize =  8    error = 15.49% 
nodesize =  9    error = 14.6% 
nodesize =  10    error = 17.26% 
nodesize =  15    error = 15.04% 
nodesize =  20    error = 18.14% 
nodesize =  25    error = 18.14% 
nodesize =  30    error = 22.57% 
nodesize =  35    error = 27.43% 
nodesize =  40    error = 28.32% 
nodesize =  45    error = 32.3% 
nodesize =  50    error = 34.51% 
nodesize =  55    error = 50.88% 
nodesize =  60    error = 50.88% 
nodesize =  65    error = 50.88% 
nodesize =  70    error = 50.88% 
nodesize =  75    error = 50.88% 
nodesize =  80    error = 50.88% 
nodesize =  85    error = 50.88% 
nodesize =  90    error = 74.34% 
nodesize =  95    error = 74.34% 
nodesize =  100    error = 74.34% 
nodesize =  105    error = 74.34% 
nodesize =  110    error = 74.34% 
optimal nodesize: 3 

OOB survival example: PBC Mayo Clinic

o$survival      --->   inbag survival estimator for each case
o$survival.oob  --->   OOB survival estimator for each case
## load the PBC data
data(pbc, package="survival")

## remove the ID
pbc$id <- NULL

## convert to right-censoring with death as the event
pbc$status[pbc$status > 0] <- 1

## default RSF call
o <- rfsrc(Surv(time, status)~., pbc)

## choose some cases
idx <- c(11,34,60)

## plot the curves
matplot(o$time.interest, 
        t(o$survival[idx,]), type = "l", col=4, lwd=3,
        xlab = "Days", ylab = "Survival")
matlines(o$time.interest, 
        t(o$survival.oob[idx,]), type = "l", col=2, lwd=3)
legend("bottomleft", legend = c("inbag", "oob"), fill = c(4,2))

Prediction error


Family Prediction Error
Regression
Quantile Regression
mean-squared error
mean-squared error
Classification
Imbalanced Two-Class
misclassification, Brier score
G-mean, misclassification, Brier score
Survival Harrell’s C-index (1 minus concordance)
Competing Risk Harrell’s C-index (1 minus concordance)
Multivariate Regression
Multivariate Classification
Multivariate Mixed Regression
Multivariate Quantile Regression
per response: same as above for Regression
per response: same as above for Classification
per response: same as above for Regression, Classification
same as Multivariate Regression
Unsupervised
sidClustering
Breiman (Shi-Horvath)
none
same as Multivariate Mixed Regression
same as Classification

Prediction error for classification

Error rate for the chosen performance metric is always available in obj$err.rate, however any error rate can be calculated using obj$yvar with obj$predicted.oob or obj$class.oob

  1. Misclassification error: get.misclass.error
  2. Brier error: get.brier.error
  3. Log-loss (cross-entropy loss): get.logloss
  4. AUC: get.auc

Classification example: Glioma

Different prediction measures

                   (OOB) Brier score: 0.02333115
        (OOB) Normalized Brier score: 0.19053769
                           (OOB) AUC: 0.99304797
                      (OOB) Log-loss: 0.34115074
   (OOB) Requested performance error: 0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462
tail(o$err.rate, 1)

>               all   Classic-like      Codel G-CIMP-high G-CIMP-low  LGm6-GBM   Mesenchymal-like   PA-like
> [500,] 0.07403189   0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462

Classification example: Glioma

Different prediction measures

y cg23970089 cg25176823 cg24576735 Chr19.20co.gain TERT.promoter.status SNP6 U133a grade age gender
Mesenchymal-like 0.36518600 0.3244530 0.03255164 No chr 19/20 gain Mutant Yes Yes G4 44 female
Mesenchymal-like 0.60246131 0.3626941 0.02215452 No chr 19/20 gain Mutant Yes Yes G4 50 male
Mesenchymal-like 0.44011951 0.3679728 0.03369096 No chr 19/20 gain Mutant Yes No G4 56 female
Classic-like 0.06834791 0.3974054 0.03623296 No chr 19/20 gain Mutant Yes Yes G4 40 female
Classic-like 0.19843219 0.2736760 0.04060163 No chr 19/20 gain Mutant Yes Yes G4 61 female
o <- rfsrc(y~., data = glioma,
           perf.type = "misclass") ## default
o
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 74.302
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: gini *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.02333115
        (OOB) Normalized Brier score: 0.19053769
                           (OOB) AUC: 0.99304797
                      (OOB) Log-loss: 0.34115074
   (OOB) Requested performance error: 0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462
o <- rfsrc(y~., data = glioma,
           perf.type = "brier") 
o
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 74.088
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: gini *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.02315138
        (OOB) Normalized Brier score: 0.18906962
                           (OOB) AUC: 0.99397374
                      (OOB) Log-loss: 0.33892075
   (OOB) Requested performance error: 0.18906962, 0.0402637, 0.01401861, 0.02455222, 0.01455572, 0.02340584, 0.06200621, 0.01026732

Classification example: Glioma

We can use the OOB ensemble to construct different kind of prediction error

err <- data.frame(brier=get.brier.error(o$yvar,o$predicted.oob,FALSE),
                  brier.norm=get.brier.error(o$yvar,o$predicted.oob),
                  log.loss=get.logloss(o$yvar,o$predicted.oob),
                  auc=get.auc(o$yvar,o$predicted.oob),
                  misclass=mean(o$yvar!=o$class.oob))
rownames(err) <- "error"

err.cond <- data.frame(rbind(
  c(err$brier.norm, get.brier.error(o$yvar,o$predicted.oob,TRUE,TRUE)),
  c(err$misclass, get.misclass.error(o$yvar,o$class.oob))))
colnames(err.cond) <- paste0("err.", colnames(o$err.rate))
err.cond <- data.frame(metric=c("brier.norm", "misclass"), err.cond)
rownames(err.cond) <- NULL
print(t(err))
                error
brier      0.02315138
brier.norm 0.18906962
log.loss   0.33892075
auc        0.99397374
misclass   0.07517084

print(err.cond)
      metric    err.all err.Classic.like  err.Codel err.G.CIMP.high err.G.CIMP.low err.LGm6.GBM err.Mesenchymal.like err.PA.like
1 brier.norm 0.18906962       0.04026370 0.01401861      0.02455222     0.01455572   0.02340584           0.06200621  0.01026732
2   misclass 0.07517084       0.05405405 0.01724138      0.01204819     0.60000000   0.53658537           0.05116279  0.15384615

Prediction error for survival

1. Harrell’s C-error rate

  1. Extract using obj$err.rate or get.cindex

  2. Easy to understand

  3. Computationally expensive

  4. Biased in the presence of heavy censoring

2. Time-varying Brier score

  1. Not intuitive
  2. Computationally inexpensive: get.brier.survival
  3. Unbiased and robust

3. Time-varying AUC (AUC-t)

  1. Easy to understand
  2. In principle robust
  3. Requires sophisticated methods for calculating (see riskRegression package)

Prediction error for survival

Time-varying brier score and AUC-t for the PBC dataset

Tip:

Code generated using plot.brier.auc

Prediction

4 major ways prediction can be used

  1. Prediction on test data: predict(object, testdata)

  2. Prediction on “artificial data” for partial plots: plot.variable, partial

  3. Prediction where a new data is substituted into the trained forest: predict(object, testdata, outcome = "test")

  4. Restoring the forest: predict(object, ...)

General call to predict

Prediction: canonical example

Veteran’s Administration Lung Cancer Trial

Randomized trial of two treatment regimens for lung cancer [2]

trt celltype time status karno diagtime age prior
1 1 72 1 60 7 69 0
1 1 411 1 70 5 64 10
1 1 228 1 60 3 38 0
1 1 126 1 60 9 63 10
1 1 118 1 70 11 65 10
1 1 10 1 20 5 49 0
1 1 82 1 40 10 69 10
1 1 110 1 80 29 68 0
1 1 314 1 50 18 43 0
1 1 100 0 70 6 70 0
data(veteran, package = "randomForestSRC")
dim(veteran)
> [1]  137   8

Prediction: canonical example

## veteran data (with factors)
data(veteran, package = "randomForestSRC")
veteran2 <- data.frame(lapply(veteran, factor))
veteran2$time <- veteran$time
veteran2$status <- veteran$status

## split the data into unbalanced train/test data (25/75)
## the train/test data have the same levels, but different labels
train <- sample(1:nrow(veteran2), round(nrow(veteran2) * .25))
> summary(veteran2[train,])
 trt    celltype      time            status           karno      diagtime       age     prior  
 1:18   1: 8     Min.   :  3.00   Min.   :0.0000   60     :7   3      : 7   60     : 3   0 :27  
 2:16   2:10     1st Qu.: 29.25   1st Qu.:1.0000   40     :6   2      : 6   66     : 3   10: 7  
        3: 9     Median : 53.50   Median :1.0000   80     :6   5      : 3   67     : 3          
        4: 7     Mean   :104.29   Mean   :0.9412   70     :4   8      : 3   69     : 3          
                 3rd Qu.:140.25   3rd Qu.:1.0000   90     :4   1      : 2   38     : 2          
                 Max.   :389.00   Max.   :1.0000   30     :3   4      : 2   48     : 2          
                                                   (Other):4   (Other):11   (Other):18   
> summary(veteran2[-train,])
 trt    celltype      time           status          karno       diagtime       age     prior  
 1:51   1:27     Min.   :  1.0   Min.   :0.000   60     :20   4      :17   63     :10   0 :70  
 2:52   2:38     1st Qu.: 21.5   1st Qu.:1.000   70     :19   2      :13   62     : 8   10:33  
        3:18     Median : 83.0   Median :1.000   80     :18   3      :11   64     : 7          
        4:20     Mean   :127.3   Mean   :0.932   30     :11   5      :11   65     : 7          
                 3rd Qu.:147.5   3rd Qu.:1.000   50     :11   11     : 5   68     : 5          
                 Max.   :999.0   Max.   :1.000   40     :10   12     : 5   70     : 5          
                                                 (Other):14   (Other):41   (Other):61  

Prediction: canonical example

## train the forest and use this to predict on test data
o <- rfsrc(Surv(time, status) ~ ., veteran2[train, ]) 
pred <- predict(o, veteran2[-train , ])
> print(o)
                         Sample size: 34
                    Number of deaths: 32
                     Number of trees: 500
           Forest terminal node size: 15
       Average no. of terminal nodes: 1
No. of variables tried at each split: 3
              Total no. of variables: 6
       Resampling used to grow trees: swor
    Resample size used to grow trees: 21
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          (OOB) CRPS: 57.73879973
                   (OOB) stand. CRPS: 0.14842879
   (OOB) Requested performance error: 0.88246269
> print(pred)
  Sample size of test (predict) data: 103
                Number of grow trees: 500
  Average no. of grow terminal nodes: 1
         Total no. of grow variables: 6
       Resampling used to grow trees: swor
    Resample size used to grow trees: 21
                            Analysis: RSF
                              Family: surv
                                CRPS: 54.3104615
                         stand. CRPS: 0.13961558
         Requested performance error: 0.49749046

Prediction: canonical example

## even harder ... factor level not previously encountered in training
veteran3 <- veteran2[1:3, ]
veteran3$celltype <- factor(c("newlevel", "1", "3"))
pred2 <- predict(o, veteran3)
print(pred2)
>   Sample size of test (predict) data: 3
>                 Number of grow trees: 500
>   Average no. of grow terminal nodes: 1
>          Total no. of grow variables: 6
>        Resampling used to grow trees: swor
>     Resample size used to grow trees: 21
>                             Analysis: RSF
>                               Family: surv
>                                 CRPS: 60.6330452
>                          stand. CRPS: 0.15586901
>          Requested performance error: 0.5

## the unusual level is treated like a missing value but is not removed
print(pred2$xvar)
>   trt celltype karno diagtime age prior
> 1   1     <NA>    60        7  69     0
> 2   1        1    70        5  64    10
> 3   1        3    60        3  38     0

Restore

predict(object, ...)

Useful feature of predict making it possible to restore the original training forest

After tuning the forest, you can use predict to obtain: \[ \begin{array}{ll} \texttt{importance} & \text{Variable Importance (VIMP)} \\ \texttt{forest.wt} & \text{Construct your own estimator} \\ \texttt{proximity} & \text{RF proximity} \\ \texttt{distance} & \text{RF distance} \\ \texttt{get.tree} & \text{Plot a tree on your browser} \\ \texttt{var.used, split.depth} & \text{Statistics summarizing variable and tree splitting} \\ \end{array} \]

Restore

Tip

predict() can restore an object from rfsrc() with a new specification

y cg23970089 cg25176823 cg24576735 Chr19.20co.gain TERT.promoter.status SNP6 U133a grade age gender
Mesenchymal-like 0.36518600 0.3244530 0.03255164 No chr 19/20 gain Mutant Yes Yes G4 44 female
Mesenchymal-like 0.60246131 0.3626941 0.02215452 No chr 19/20 gain Mutant Yes Yes G4 50 male
Mesenchymal-like 0.44011951 0.3679728 0.03369096 No chr 19/20 gain Mutant Yes No G4 56 female
Classic-like 0.06834791 0.3974054 0.03623296 No chr 19/20 gain Mutant Yes Yes G4 40 female
Classic-like 0.19843219 0.2736760 0.04060163 No chr 19/20 gain Mutant Yes Yes G4 61 female
o <- rfsrc(y~., data = glioma) 
o
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 74.302
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: gini *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.02333115
        (OOB) Normalized Brier score: 0.19053769
                           (OOB) AUC: 0.99304797
                      (OOB) Log-loss: 0.34115074
   (OOB) Requested performance error: 0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462
p <- predict(o, perf.type = "brier") 
p
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 74.302
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: gini *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.02333115
        (OOB) Normalized Brier score: 0.19053769
                           (OOB) AUC: 0.99304797
                      (OOB) Log-loss: 0.34115074
   (OOB) Requested performance error: 0.18688257, 0.03438887, 0.01210945, 0.02103588, 0.01195455, 0.01952019, 0.05238402, 0.0087921

Restore

Create your own estimator for regression

## create your own estimator for regression
o <- rfsrc(mpg~.,mtcars)
fwt <- predict(o, forest.wt="oob")$forest.wt
yhat <- c(fwt %*% o$yvar)

## compare to the OOB ensemble
print(summary(yhat - o$predicted.oob))
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-3.908e-14 -1.643e-14 -1.066e-14 -1.060e-14 -3.553e-15  1.066e-14
      
## compare prediction error to OOB ensemble
print(o)

                        Sample size: 32
                     Number of trees: 500
           Forest terminal node size: 5
       Average no. of terminal nodes: 3.476
No. of variables tried at each split: 4
              Total no. of variables: 10
       Resampling used to grow trees: swor
    Resample size used to grow trees: 20
                            Analysis: RF-R
                              Family: regr
                      Splitting rule: mse *random*
       Number of random split points: 10
                     (OOB) R squared: 0.76585425
   (OOB) Requested performance error: 8.50513415

print(mean((yhat - o$yvar)^2))
8.505134

Partial plots

Partial plots are used to understand the effect of a feature on the outcome and are essentially prediction tools

Consider nonparametric regression \[ F(x_1,\ldots,x_p) = E[Y|(x_1,\ldots,x_p)] \] To study variable \(j\) we plot \(F_j(x)\) over \(x\) where \[ F_j(x) = \frac{1}{n}\sum_{i=1}^nF_{\!\text{ens}}(x_1=x_{i1},\ldots,\color{red}{x_j=x},\color{black}\ldots,x_p=x_{ip}) \] and \[ \Bigg\{\Big(x_1=x_{i1},\ldots,\color{red}{x_j=x}\color{black},\ldots,x_p=x_{ip}\Big):\hskip10pt i=1,\ldots,n\Bigg\} \] are “test” data values equal to the training data but where coordinate \(j\) is set to \(x\)

General call to plot.variable

Regression example: Iowa housing

We can use plot.variable to visualize \(F_j(x)\)

o <- rfsrc(SalePrice~., data = housing)
## partial effect for MS.Zoning
plot.variable(o, xvar.names = "MS.Zoning", partial = TRUE,
              ylab = "partial effect of MS.Zoning")

Regression example: Iowa housing

## partial effect for Overall.Qual
plot.variable(o, xvar.names = "Overall.Qual", partial = TRUE,
              ylab = "partial effect of Overall.Qual")

Regression example: Iowa housing

## partial effect for Overall.Qual
plot.variable(o, xvar.names = "Overall.Qual", partial = TRUE,
              smooth.lines = TRUE,
              ylab = "partial effect of Overall.Qual")

Regression example: Iowa housing

## subset of Lot.Frontage>100
plot.variable(o, xvar.names = "Overall.Qual", partial = TRUE,
              smooth.lines = TRUE,
              subset = o$xvar$Lot.Frontage>100,
              ylab = "partial effect of Overall.Qual", main = "Lot Frontage > 100")

Regression example: Iowa housing

We can use plot.variable to visualize multiple variables simultaneously

## multiple partial effects
plot.variable(o, xvar.names = c("MS.Zoning", "Overall.Qual"), 
              plots.per.page = 2,
              partial = TRUE)

Regression example: Iowa housing

We can make changes to a pre-processed plot.variable object

p.v <- plot.variable(o, 
               xvar.names = c("MS.Zoning",
                               "Overall.Qual"), 
                     partial = TRUE, 
                     show.plots = FALSE)
plot.variable(p.v)

plotthis <- p.v$plotthis
plot(plotthis[["MS.Zoning"]], 
     ylab = "partial effect", 
     type = "b")
plot(plotthis[["Overall.Qual"]], 
     ylab = "partial effect")

General call to partial

Regression example: Iowa housing

We can use partial and get.partial.plot.data to obtain values of \(F_j(x)\)

partial.obj <- partial(o,
                       partial.xvar = "Overall.Qual",
                       partial.values = o$xvar$Overall.Qual)
pdta <- get.partial.plot.data(partial.obj, granule = TRUE)
boxplot(pdta$yhat ~ pdta$x, xlab = "Overall.Qual", ylab = "partial effect")

Regression example: Iowa housing

We can use partial and get.partial.plot.data to obtain values of \(F_j(x)\)

## specify two variables of interest
Overall.Qual <- sort(unique(o$xvar$Overall.Qual)); Gr.Liv.Area <- sort(unique(o$xvar$Gr.Liv.Area))
pdta <- do.call(rbind, lapply(Gr.Liv.Area, function(x2) {
  o <- partial(o,
               partial.xvar = "Overall.Qual", partial.xvar2 = "Gr.Liv.Area",
               partial.values = Overall.Qual, partial.values2 = x2)
  cbind(Overall.Qual, x2, get.partial.plot.data(o)$yhat)
}))
pdta <- data.frame(pdta); colnames(pdta) <- c("Overall.Qual", "Gr.Liv.Area", "effectSize")
coplot(effectSize ~ Overall.Qual|Gr.Liv.Area, pdta, pch = 16, overlap = 0)

Classification example: Glioma

We can use plot.variable to visualize \(F_j(x)\)

o <- rfsrc(y~., data = glioma)
## partial effect for TERT.promoter.status; defaults to focus on the first class
plot.variable(o, xvar.names = "TERT.promoter.status", 
              partial = TRUE)

Classification example: Glioma

# specifying the class label to focus on using "target"
plot.variable(o, 
              xvar.names = "TERT.promoter.status", 
              target = c("Codel"),
              partial = TRUE)

Classification example: Glioma

We can use partial and get.partial.plot.data to obtain values of \(F_j(x)\)

PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley Lot.Shape Land.Contour Lot.Frontage.1 Garage.Yr.Blt
526301100 20 RL 141 31770 Pave noAlleyAccess IR1 Lvl 141 1960
526350040 20 RH 80 11622 Pave noAlleyAccess Reg Lvl 80 1961
526351010 20 RL 81 14267 Pave noAlleyAccess IR1 Lvl 81 1958
partial.obj <- partial(o, partial.xvar = "TERT.promoter.status", partial.values = o$xvar$TERT.promoter.status)
## extract partial effects for each species outcome
pdta <- list(); ylim <- c(); lvls <- levels(o$yvar)
for (i in 1:length(lvls)){
pdta[[i]] <- get.partial.plot.data(partial.obj, target = lvls[i])
ylim <- c(ylim, range(pdta[[i]]$yhat))
}
plot(pdta[[1]]$x, rep(NA, length(pdta[[1]]$x)), 
     xlab = "TERT.promoter.status", ylab = "adjusted probability", 
     ylim = range(ylim))
for (i in 1:length(lvls)){
points(pdta[[i]]$x, pdta[[i]]$yhat, col = i, type = "b", pch = 16)
}
legend("topleft", legend=levels(o$yvar), fill = 1:length(lvls))

Survival example: peakVO2

The data involve 2231 patients with systolic heart failure who underwent cardiopulmonary stress testing at the Cleveland Clinic. The primary end point was all-cause death. In total, 39 variables were measured for each patient, including baseline clinical values and exercise stress test results. A key variable of interest is peak VO2 (mL/kg per min), the peak respiratory exchange ratio [3]

ttodead died bun sodium hgb glucose male black crcl age betablok dilver nifed acei angioten.II anti.arrhy anti.coag aspirin digoxin
1.207392 0 25.00000 141.0000 14.60000 89.00000 1 0 49.92778 74 1 0 0 0 1 1 1 1 0
3.611225 0 20.00000 138.0000 13.10000 90.00000 0 0 60.35000 77 1 0 0 1 0 0 0 1 1
0.238193 1 29.68983 139.8368 12.87517 98.74435 1 0 33.61172 79 0 0 0 0 0 0 0 1 1
10.513347 0 27.87734 139.4590 13.16225 136.87414 1 0 69.25114 67 0 0 0 1 0 0 1 0 1
3.712526 1 39.00000 143.0000 12.50000 104.00000 1 0 44.67685 79 1 0 0 0 1 0 0 1 0
data(peakVO2, package = "randomForestSRC") 
dim(peakVO2)
> [1]  2231   41

Survival example: peakVO2

We can use plot.variable to visualize \(F_j(x)\)

o <- rfsrc(Surv(ttodead,died)~., peakVO2)
## partial effect for age
plot.variable(o, surv.type = "surv", xvar.names = "age", 
              time = 2, partial = TRUE)

Survival example: peakVO2


## partial effect for age
plot.variable(o, surv.type = "surv", xvar.names = "age", 
              smooth.lines = TRUE,
              time = 2, partial = TRUE)

Survival example: peakVO2


## partial effect for age
plot.variable(o, surv.type = "rel.freq", xvar.names = "age", 
              smooth.lines = TRUE,
              partial = TRUE)

Survival example: peakVO2

We can use partial and get.partial.plot.data to obtain values of \(F_j(x)\)

## partial effect of peak V02 on mortality

partial.o <- partial(o,
                     partial.type = "mort",
                     partial.xvar = "peak.vo2",
                     partial.values = o$xvar$peak.vo2,
                     partial.time = o$time.interest)
pdta.m <- get.partial.plot.data(partial.o)
## partial effect of peak V02 on survival
pvo2 <- quantile(o$xvar$peak.vo2)
partial.o <- partial(o,
                     partial.type = "surv",
                     partial.xvar = "peak.vo2",
                     partial.values = pvo2,
                     partial.time = o$time.interest)
pdta.s <- get.partial.plot.data(partial.o)


## compare the two plots
par(mfrow=c(1,2))    

plot(lowess(pdta.m$x, pdta.m$yhat, f = 2/3),
     type = "l", xlab = "peak VO2", ylab = "adjusted mortality")
rug(o$xvar$peak.vo2)

matplot(pdta.s$partial.time, t(pdta.s$yhat), type = "l", lty = 1,
        xlab = "years", ylab = "peak VO2 adjusted survival")
legend("bottomleft", legend = paste0("peak VO2 = ", pvo2),
       bty = "n", cex = .75, fill = 1:5)

Outline

Part I: Training

  1. Quick start
  2. Data structures allowed
  3. Training (grow) with examples
    (regression, classification, survival)

Part II: Inference and Prediction

  1. Inference (OOB)
  2. Prediction Error
  3. Prediction
  4. Restore
  5. Partial Plots

Part III: Variable Selection

  1. VIMP
  2. Subsampling (Confidence Intervals)
  3. Minimal Depth
  4. VarPro

Part IV: Advanced Examples

  1. Class Imbalanced Data
  2. Competing Risks
  3. Multivariate Forests
  4. Missing data imputation

References

1. Efron B, Tibshirani R. Improvements on cross-validation: The 632+ bootstrap method. Journal of the American Statistical Association. 1997;92:548–60.
2. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. John Wiley & Sons; 1980.
3. Hsich E, Gorodeski EZ, Blackstone EH, Ishwaran H, Lauer MS. Identifying important risk factors for survival in patient with systolic heart failure using random survival forests. Circulation: Cardiovascular Quality and Outcomes. 2011;4:39–45.