per response: mean, mean per response: class proportions, class proportions, Bayes classifier same as above for Regression, Classification per response: quantiles, mean
Unsupervised sidClustering Breiman (Shi-Horvath)
none same as Multivariate Mixed Regression same as Classification
Inference
RF actually produces two ensembles!
The inbag ensemble is the averaged ensemble over the trained trees, where each tree is trained by a subsampled (bootstrap data) \[
F_{\!\text{ens}}(\mathbf{x})
=\frac{1}{B}\sum_{b=1}^B h^*_b(\mathbf{x})
\]
The out-of-bag (OOB) ensemble is case-specific. For case \(i\), the average is taken over trees where \(i\) is OOB \[
F_{\!\text{ens}}^{(i)}
=\frac{1}{\#O_i}\sum_{b\in O_i} h^*_b(\mathbf{x}_i),\hskip10pt
\#O_i\approx .368 B
\]
The inbag ensemble is used for prediction on test data. The OOB ensemble is used for inference
OOB ensemble
Equivalent to a leave-one-out (loo) bootstrap estimator [1]. Therefore it can be used for estimating generalization error \[
\text{Err}
= \frac{1}{n}\sum_{i=1}^n L(Y_i, F_{\!\text{ens}}^{(i)})
\] For example, for regression \[
\text{Prediction Error}
= \frac{1}{n}\sum_{i=1}^n (Y_i-F_{\!\text{ens}}^{(i)})^2
\]
Avoids the need for cross-validation. Very conveniently for prediction error and for tuning the forest.
Provides cross-validated estimator for model – a unique feature not seen with other ML methods
OOB classification example: Glioma
y
cg11012046
cg21859434
cg23970089
cg25176823
cg24576735
age
cg19099850
cg17403875
cg24078828
Mesenchymal-like
0.07659845
0.013786153
0.3651860
0.32445298
0.03255164
44
0.04020459
0.01333304
0.02021659
Mesenchymal-like
0.15108266
0.009155272
0.6024613
0.36269407
0.02215452
50
0.03771707
0.01045362
0.02098629
Classic-like
0.42735137
0.380247358
0.1984322
0.27367605
0.04060163
61
0.03648317
0.01304019
0.02570674
G-CIMP-low
0.93509040
0.861978174
0.5049500
0.47706717
0.03564740
20
0.88262115
0.62531752
0.42465204
LGm6-GBM
0.04602123
0.804717273
0.2564513
0.03827349
0.03625206
18
0.82269945
0.80449412
0.78443442
The key quantities returned by the package are
o$predicted ---> inbag estimated probabilitieso$predicted.oob ---> OOB estimated probabilitieso$class ---> inbag class predictionso$class.oob ---> OOB class predictions
o <-rfsrc(y~., data = glioma)mean(o$class!=o$yvar)> [1] 0# inbag misclassification is zero!mean(o$class.oob!=o$yvar)> [1] 0.0671982
o$survival ---> inbag survival estimator for each caseo$survival.oob ---> OOB survival estimator for each case
## load the PBC datadata(pbc, package="survival")## remove the IDpbc$id <-NULL## convert to right-censoring with death as the eventpbc$status[pbc$status >0] <-1## default RSF callo <-rfsrc(Surv(time, status)~., pbc)## choose some casesidx <-c(11,34,60)## plot the curvesmatplot(o$time.interest, t(o$survival[idx,]), type ="l", col=4, lwd=3,xlab ="Days", ylab ="Survival")matlines(o$time.interest, t(o$survival.oob[idx,]), type ="l", col=2, lwd=3)legend("bottomleft", legend =c("inbag", "oob"), fill =c(4,2))
per response: same as above for Regression per response: same as above for Classification per response: same as above for Regression, Classification same as Multivariate Regression
Unsupervised sidClustering Breiman (Shi-Horvath)
none same as Multivariate Mixed Regression same as Classification
Prediction error for classification
Error rate for the chosen performance metric is always available in obj$err.rate, however any error rate can be calculated using obj$yvar with obj$predicted.oob or obj$class.oob
o <-rfsrc(y~., data = glioma,perf.type ="misclass") ## defaulto Sample size:878 Frequency of class labels:148, 174, 249, 25, 41, 215, 26 Number of trees:500 Forest terminal node size:1 Average no. of terminal nodes:74.302No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swor Resample size used to grow trees:555 Analysis: RF-C Family: class Splitting rule: gini *random* Number of random split points:10 (OOB) Brier score:0.02333115 (OOB) Normalized Brier score:0.19053769 (OOB) AUC:0.99304797 (OOB) Log-loss:0.34115074 (OOB) Requested performance error:0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462
o <-rfsrc(y~., data = glioma,perf.type ="brier") o Sample size:878 Frequency of class labels:148, 174, 249, 25, 41, 215, 26 Number of trees:500 Forest terminal node size:1 Average no. of terminal nodes:74.088No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swor Resample size used to grow trees:555 Analysis: RF-C Family: class Splitting rule: gini *random* Number of random split points:10 (OOB) Brier score:0.02315138 (OOB) Normalized Brier score:0.18906962 (OOB) AUC:0.99397374 (OOB) Log-loss:0.33892075 (OOB) Requested performance error:0.18906962, 0.0402637, 0.01401861, 0.02455222, 0.01455572, 0.02340584, 0.06200621, 0.01026732
Classification example: Glioma
We can use the OOB ensemble to construct different kind of prediction error
## veteran data (with factors)data(veteran, package ="randomForestSRC")veteran2 <-data.frame(lapply(veteran, factor))veteran2$time <- veteran$timeveteran2$status <- veteran$status## split the data into unbalanced train/test data (25/75)## the train/test data have the same levels, but different labelstrain <-sample(1:nrow(veteran2), round(nrow(veteran2) * .25))
>summary(veteran2[train,]) trt celltype time status karno diagtime age prior 1:181:8 Min. :3.00 Min. :0.000060:73:760:30:272:162:101st Qu.:29.251st Qu.:1.000040:62:666:310:73:9 Median :53.50 Median :1.000080:65:367:34:7 Mean :104.29 Mean :0.941270:48:369:33rd Qu.:140.253rd Qu.:1.000090:41:238:2 Max. :389.00 Max. :1.000030:34:248:2 (Other):4 (Other):11 (Other):18
>summary(veteran2[-train,]) trt celltype time status karno diagtime age prior 1:511:27 Min. :1.0 Min. :0.00060:204:1763:100:702:522:381st Qu.:21.51st Qu.:1.00070:192:1362:810:333:18 Median :83.0 Median :1.00080:183:1164:74:20 Mean :127.3 Mean :0.93230:115:1165:73rd Qu.:147.53rd Qu.:1.00050:1111:568:5 Max. :999.0 Max. :1.00040:1012:570:5 (Other):14 (Other):41 (Other):61
Prediction: canonical example
## train the forest and use this to predict on test datao <-rfsrc(Surv(time, status) ~ ., veteran2[train, ]) pred <-predict(o, veteran2[-train , ])
>print(o) Sample size:34 Number of deaths:32 Number of trees:500 Forest terminal node size:15 Average no. of terminal nodes:1No. of variables tried at each split:3 Total no. of variables:6 Resampling used to grow trees: swor Resample size used to grow trees:21 Analysis: RSF Family: surv Splitting rule: logrank *random* Number of random split points:10 (OOB) CRPS:57.73879973 (OOB) stand. CRPS:0.14842879 (OOB) Requested performance error:0.88246269
>print(pred) Sample size of test (predict) data:103 Number of grow trees:500 Average no. of grow terminal nodes:1 Total no. of grow variables:6 Resampling used to grow trees: swor Resample size used to grow trees:21 Analysis: RSF Family: surv CRPS:54.3104615 stand. CRPS:0.13961558 Requested performance error:0.49749046
Prediction: canonical example
## even harder ... factor level not previously encountered in trainingveteran3 <- veteran2[1:3, ]veteran3$celltype <-factor(c("newlevel", "1", "3"))pred2 <-predict(o, veteran3)print(pred2)> Sample size of test (predict) data:3> Number of grow trees:500> Average no. of grow terminal nodes:1> Total no. of grow variables:6> Resampling used to grow trees: swor> Resample size used to grow trees:21> Analysis: RSF> Family: surv> CRPS:60.6330452> stand. CRPS:0.15586901> Requested performance error:0.5## the unusual level is treated like a missing value but is not removedprint(pred2$xvar)> trt celltype karno diagtime age prior>11<NA>607690>2117056410>313603380
Restore
predict(object, ...)
Useful feature of predict making it possible to restore the original training forest
After tuning the forest, you can use predict to obtain: \[
\begin{array}{ll}
\texttt{importance} & \text{Variable Importance (VIMP)} \\
\texttt{forest.wt} & \text{Construct your own estimator} \\
\texttt{proximity} & \text{RF proximity} \\
\texttt{distance} & \text{RF distance} \\
\texttt{get.tree} & \text{Plot a tree on your browser} \\
\texttt{var.used, split.depth} & \text{Statistics summarizing variable and tree splitting} \\
\end{array}
\]
Restore
Tip
predict() can restore an object from rfsrc() with a new specification
y
cg23970089
cg25176823
cg24576735
Chr19.20co.gain
TERT.promoter.status
SNP6
U133a
grade
age
gender
Mesenchymal-like
0.36518600
0.3244530
0.03255164
No chr 19/20 gain
Mutant
Yes
Yes
G4
44
female
Mesenchymal-like
0.60246131
0.3626941
0.02215452
No chr 19/20 gain
Mutant
Yes
Yes
G4
50
male
Mesenchymal-like
0.44011951
0.3679728
0.03369096
No chr 19/20 gain
Mutant
Yes
No
G4
56
female
Classic-like
0.06834791
0.3974054
0.03623296
No chr 19/20 gain
Mutant
Yes
Yes
G4
40
female
Classic-like
0.19843219
0.2736760
0.04060163
No chr 19/20 gain
Mutant
Yes
Yes
G4
61
female
o <-rfsrc(y~., data = glioma) o Sample size:878 Frequency of class labels:148, 174, 249, 25, 41, 215, 26 Number of trees:500 Forest terminal node size:1 Average no. of terminal nodes:74.302No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swor Resample size used to grow trees:555 Analysis: RF-C Family: class Splitting rule: gini *random* Number of random split points:10 (OOB) Brier score:0.02333115 (OOB) Normalized Brier score:0.19053769 (OOB) AUC:0.99304797 (OOB) Log-loss:0.34115074 (OOB) Requested performance error:0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462
p <-predict(o, perf.type ="brier") p Sample size:878 Frequency of class labels:148, 174, 249, 25, 41, 215, 26 Number of trees:500 Forest terminal node size:1 Average no. of terminal nodes:74.302No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swor Resample size used to grow trees:555 Analysis: RF-C Family: class Splitting rule: gini *random* Number of random split points:10 (OOB) Brier score:0.02333115 (OOB) Normalized Brier score:0.19053769 (OOB) AUC:0.99304797 (OOB) Log-loss:0.34115074 (OOB) Requested performance error:0.18688257, 0.03438887, 0.01210945, 0.02103588, 0.01195455, 0.01952019, 0.05238402, 0.0087921
Restore
Create your own estimator for regression
## create your own estimator for regressiono <-rfsrc(mpg~.,mtcars)fwt <-predict(o, forest.wt="oob")$forest.wtyhat <-c(fwt %*% o$yvar)## compare to the OOB ensembleprint(summary(yhat - o$predicted.oob)) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.908e-14-1.643e-14-1.066e-14-1.060e-14-3.553e-151.066e-14## compare prediction error to OOB ensembleprint(o) Sample size:32 Number of trees:500 Forest terminal node size:5 Average no. of terminal nodes:3.476No. of variables tried at each split:4 Total no. of variables:10 Resampling used to grow trees: swor Resample size used to grow trees:20 Analysis: RF-R Family: regr Splitting rule: mse *random* Number of random split points:10 (OOB) R squared:0.76585425 (OOB) Requested performance error:8.50513415print(mean((yhat - o$yvar)^2))8.505134
Partial plots
Partial plots are used to understand the effect of a feature on the outcome and are essentially prediction tools
Consider nonparametric regression \[
F(x_1,\ldots,x_p) = E[Y|(x_1,\ldots,x_p)]
\] To study variable \(j\) we plot \(F_j(x)\) over \(x\) where \[
F_j(x)
= \frac{1}{n}\sum_{i=1}^nF_{\!\text{ens}}(x_1=x_{i1},\ldots,\color{red}{x_j=x},\color{black}\ldots,x_p=x_{ip})
\] and \[
\Bigg\{\Big(x_1=x_{i1},\ldots,\color{red}{x_j=x}\color{black},\ldots,x_p=x_{ip}\Big):\hskip10pt
i=1,\ldots,n\Bigg\}
\] are “test” data values equal to the training data but where coordinate \(j\) is set to \(x\)
General call to plot.variable
Regression example: Iowa housing
We can use plot.variable to visualize \(F_j(x)\)
o <-rfsrc(SalePrice~., data = housing)## partial effect for MS.Zoningplot.variable(o, xvar.names ="MS.Zoning", partial =TRUE,ylab ="partial effect of MS.Zoning")
Regression example: Iowa housing
## partial effect for Overall.Qualplot.variable(o, xvar.names ="Overall.Qual", partial =TRUE,ylab ="partial effect of Overall.Qual")
Regression example: Iowa housing
## partial effect for Overall.Qualplot.variable(o, xvar.names ="Overall.Qual", partial =TRUE,smooth.lines =TRUE,ylab ="partial effect of Overall.Qual")
Regression example: Iowa housing
## subset of Lot.Frontage>100plot.variable(o, xvar.names ="Overall.Qual", partial =TRUE,smooth.lines =TRUE,subset = o$xvar$Lot.Frontage>100,ylab ="partial effect of Overall.Qual", main ="Lot Frontage > 100")
Regression example: Iowa housing
We can use plot.variable to visualize multiple variables simultaneously
o <-rfsrc(y~., data = glioma)## partial effect for TERT.promoter.status; defaults to focus on the first classplot.variable(o, xvar.names ="TERT.promoter.status", partial =TRUE)
Classification example: Glioma
# specifying the class label to focus on using "target"plot.variable(o, xvar.names ="TERT.promoter.status", target =c("Codel"),partial =TRUE)
Classification example: Glioma
We can use partial and get.partial.plot.data to obtain values of \(F_j(x)\)
PID
MS.SubClass
MS.Zoning
Lot.Frontage
Lot.Area
Street
Alley
Lot.Shape
Land.Contour
Lot.Frontage.1
Garage.Yr.Blt
526301100
20
RL
141
31770
Pave
noAlleyAccess
IR1
Lvl
141
1960
526350040
20
RH
80
11622
Pave
noAlleyAccess
Reg
Lvl
80
1961
526351010
20
RL
81
14267
Pave
noAlleyAccess
IR1
Lvl
81
1958
partial.obj <-partial(o, partial.xvar ="TERT.promoter.status", partial.values = o$xvar$TERT.promoter.status)## extract partial effects for each species outcomepdta <-list(); ylim <-c(); lvls <-levels(o$yvar)for (i in1:length(lvls)){pdta[[i]] <-get.partial.plot.data(partial.obj, target = lvls[i])ylim <-c(ylim, range(pdta[[i]]$yhat))}
plot(pdta[[1]]$x, rep(NA, length(pdta[[1]]$x)), xlab ="TERT.promoter.status", ylab ="adjusted probability", ylim =range(ylim))for (i in1:length(lvls)){points(pdta[[i]]$x, pdta[[i]]$yhat, col = i, type ="b", pch =16)}legend("topleft", legend=levels(o$yvar), fill =1:length(lvls))
Survival example: peakVO2
The data involve 2231 patients with systolic heart failure who underwent cardiopulmonary stress testing at the Cleveland Clinic. The primary end point was all-cause death. In total, 39 variables were measured for each patient, including baseline clinical values and exercise stress test results. A key variable of interest is peak VO2 (mL/kg per min), the peak respiratory exchange ratio [3]
1. Efron B, Tibshirani R. Improvements on cross-validation: The 632+ bootstrap method. Journal of the American Statistical Association. 1997;92:548–60.
2. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. John Wiley & Sons; 1980.
3. Hsich E, Gorodeski EZ, Blackstone EH, Ishwaran H, Lauer MS. Identifying important risk factors for survival in patient with systolic heart failure using random survival forests. Circulation: Cardiovascular Quality and Outcomes. 2011;4:39–45.