flowchart LR A[Grow a forest] --> B[Check the output] --> C[Make the prediction]
Part I
This cover covers the following R packages
randomForestSRC
: Fast Unified Random Forests for Survival, Regression, and Classification (CRAN link)
randomForestSRC.run
: Integrated visualization for the randomForestSRC
package (Github link)
varPro
: Model-Independent Variable Selection via the Rule-Based Variable Priority (Github link)
randomForestSRC is a fast OpenMP and memory efficient package for fitting random forests (RF) for univariate, multivariate, unsupervised, survival, competing risks, class imbalanced classification and quantile regression
A random forest (RF) [1] consists of a collection of random tree structured predictors {\(h({\bf x},{{\bf\Theta}_b}), b= 1,2,\ldots, B\)} where {\({{\bf\Theta}_b}\)} are independent identically distributed random vectors. The RF predictor is the ensemble \[F_{{\!\text{ens}}}({\bf x})=\frac{1}{B}\sum_{b=1}^B h({\bf x},{{\bf\Theta}_b})\]
Typically, \({{\bf\Theta}_b}\) encode the instructions for bootstrapping the data and for random feature selection used for tree splitting
rfsrc(formula, data, ntree, mtry, nodesize)
Grow (train) your RF through rfsrc
Specify your model in formula
Provide your data frame in data
Tune your model via ntree, mtry, nodesize
rfsrc(formula, data, ntree, mtry, nodesize)
Family | Example Grow Call with Formula Specification |
---|---|
Survival | rfsrc(Surv(time, status)~., data = veteran) |
Competing Risk | rfsrc(Surv(time, status)~., data = wihs) |
Regression Quantile Regression |
rfsrc(Ozone~., data = airquality) quantreg(mpg~., data = mtcars)
|
Classification Imbalanced Two-Class |
rfsrc(Species~., data = iris) imbalanced(status~., data = breast)
|
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
rfsrc(Multivar(mpg, cyl)~., data = mtcars) rfsrc(cbind(Species,Sepal.Length)~.,data=iris) quantreg(cbind(mpg, cyl)~., data = mtcars) quantreg(cbind(Species,Sepal.Length)~.,data=iris)
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
rfsrc(data = mtcars) sidClustering(data = mtcars) sidClustering(data = mtcars, method = "sh")
|
Learn more [2]: https://www.randomforestsrc.org/articles/getstarted.html
flowchart LR A[Grow a forest] --> B[Check the output] --> C[Make the prediction]
flowchart LR A[<h3 style="color:darkgreen">Grow a forest</h3>] --> B[Check the output] --> C[Make the prediction]
flowchart LR A[Grow a forest] --> B[<h3 style="color:darkgreen">Check the output</h3>] --> C[Make the prediction]
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
Tip
The last line(s) will show the cross validated prediction performance
$predicted.oob
), and the latter are from the bootstrap data used to train the tree ($predicted
)$predicted.oob
should be used for inference on the training data since they are cross-validated and will not over-fit the data. we can get the error rate (mean-squared error) directly from the OOB ensemble by comparing the response to the OOB predictor below
Learn more [3]: https://www.randomforestsrc.org/articles/forestWgt.html
Family | Prediction Error |
---|---|
Regression Quantile Regression |
mean-squared error mean-squared error |
Classification Imbalanced Two-Class |
misclassification, Brier score G-mean, misclassification, Brier score |
Survival | Harrell’s C-index (1 minus concordance) |
Competing Risk | Harrell’s C-index (1 minus concordance) |
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression |
per response: same as above for Regression per response: same as above for Classification per response: same as above for Regression, Classification same as Multivariate Regression |
Unsupervised sidClustering Breiman (Shi-Horvath) |
none same as Multivariate Mixed Regression same as Classification |
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
Default is number of variables divided by 3 for regression. For all other families (including unsupervised settings), it is the square root of number of variables. Values are rounded up
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
line 8 displays the sample size for line 7 where for swor
, the number equals to about 63.2% observations, which matches the usual bootstrap sample size
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
line 9 and 10 show the type of forest where RF-R
and regr
refer to regression
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)
> 1 Sample size: 111
> 2 Number of trees: 500
> 3 Forest terminal node size: 5
> 4 Average no. of terminal nodes: 14.68
> 5 No. of variables tried at each split: 2
> 6 Total no. of variables: 5
> 7 Resampling used to grow trees: swor
> 8 Resample size used to grow trees: 70
> 9 Analysis: RF-R
> 10 Family: regr
> 11 Splitting rule: mse *random*
> 12 Number of random split points: 10
> 13 (OOB) R squared: 0.70871819
> 14 (OOB) Requested performance error: 322.53346594
flowchart LR A[Grow a forest] --> B[Check the output] --> C[<h3 style="color:darkgreen">Make the prediction</h3>]
Data must be real or factors and in data.frame
format and you have to use the as.data.frame()
command to convert matrix
, tibble
, etc
Choose your variables in formula
and grow a tree
your outcome(s) will be saved in o$yvar
and your predictors are in o$xvar
from dta
without missing values
To impute your data, use the following commands
rfsrc.cart(formula, data, ntree = 1, mtry = ncol(data), bootstrap = “none”, …)
Tip: convenient interface for growing a Classification And Regression Tree (CART)
rfsrc.cart()
is useful for growing single trees of almost any type
Tip:
get.tree()
is useful for visualizing a single tree from a rfsrc.cart
or rfsrc
object specified by tree.id
Tip:
get.tree()
is useful for visualizing a single tree from a rfsrc.cart
or rfsrc
object specified by tree.id
Family | Example Grow Call with Formula Specification |
---|---|
Regression Quantile Regression |
rfsrc(Ozone~., data = airquality) quantreg(mpg~., data = mtcars)
|
Classification Imbalanced Two-Class |
rfsrc(Species~., data = iris) imbalanced(status~., data = breast)
|
Survival | rfsrc(Surv(time, status)~., data = veteran) |
Competing Risk | rfsrc(Surv(time, status)~., data = wihs) |
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
rfsrc(Multivar(mpg, cyl)~., data = mtcars) rfsrc(cbind(Species,Sepal.Length)~.,data=iris) quantreg(cbind(mpg, cyl)~., data = mtcars) quantreg(cbind(Species,Sepal.Length)~.,data=iris)
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
rfsrc(data = mtcars) sidClustering(data = mtcars) sidClustering(data = mtcars, method = "sh")
|
The goal is to estimate \(F({\bf{x}})=E(Y|{\bf{X}}={\bf{x}})\) by minimizing the empirical risk functional \[ R_n(F) = \frac{1}{n}\sum_{i=1}^n (Y_i-F({\bf{x}}_i))^2 \] The tree accomplishes this piece-by-piece. For a node \(t\), the tree replaces the node estimator \({\overline{Y}}_t\) by daughter node estimators \({\overline{Y}}_{t_L}\) and \({\overline{Y}}_{t_R}\)
The empirical risk reduction from this split of \(t\) into its daughers \(t_L\) and \(t_R\) is \[ \frac{1}{n}\sum_{i\in t} (Y_i-{\overline{Y}}_t)^2 -\Bigg[\frac{1}{n}\sum_{i\in t_L} (Y_i-{\overline{Y}}_{t_L})^2 + \frac{1}{n}\sum_{i\in t_R} (Y_i-{\overline{Y}}_{t_R})^2\Bigg] \] Maximizing this decrease is equivalent to minimizing \[ \frac{1}{n}\sum_{i\in t_L} (Y_i-{\overline{Y}}_{t_L})^2 + \frac{1}{n}\sum_{i\in t_R} (Y_i-{\overline{Y}}_{t_R})^2 \] This leads to the weighted variance (mse) splitting used by CART and RF
Family | splitrule |
---|---|
Regression Quantile Regression |
mse la.quantile.regr, quantile.regr, mse
|
Classification Imbalanced Two-Class |
gini, auc, entropy gini, auc, entropy
|
Survival | logrank, bs.gradient, logrankscore |
Competing Risk | logrankCR, logrank |
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
mv.mse, mahalanobis mv.gini mv.mix mv.mse mv.mix
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
unsupv \(\{\) mv.mse, mv.gini, mv.mix \(\}\), mahalanobis gini, auc, entropy
|
Let \(Y_1,\ldots,Y_n\) be the outcomes in \(t\), the mse split-statistic for a proposed split \(s\) is \[D(s,t)=\frac{1}{n}\sum_{i\in t_L}(Y_i-\overline{Y}_{t_L})^2 + \frac{1}{n}\sum_{i\in t_R}(Y_i-\overline{Y}_{t_R})^2\] where
are the left and right daughter nodes and
are the sample means for \(t_L\) and \(t_R\) respectively. The best split for \(X\) is the split-point \(s\) minimizing \(D(s,t)\) [4]
Let \(Y_1,\ldots,Y_n\) be the outcomes in \(t\), the mse split-statistic for a proposed split \(s\) is \[D(s,t)=\frac{1}{n}\sum_{i\in t_L}(Y_i-\overline{Y}_{t_L})^2 + \frac{1}{n}\sum_{i\in t_R}(Y_i-\overline{Y}_{t_R})^2\] where
are the left and right daughter nodes and
are the sample means for \(t_L\) and \(t_R\) respectively. The best split for \(X\) is the split-point \(s\) minimizing \(D(s,t)\) [4]
The key quantities returned by the package are
Data from the Ames Assessor’s Office used in assessing values of individual residential properties sold in Ames, Iowa from 2006 to 2010. This is a regression problem and the goal is to predict “SalePrice” which records the price of a home in thousands of dollars [5]
PID | MS.SubClass | MS.Zoning | Lot.Frontage | Lot.Area | Street | Alley | Lot.Shape | Land.Contour | Utilities | Lot.Config | Land.Slope | Neighborhood | Condition.1 | Condition.2 | Bldg.Type | House.Style | Overall.Qual | Overall.Cond | Year.Built | Year.Remod.Add | Roof.Style | Roof.Matl | Exterior.1st | Exterior.2nd | Mas.Vnr.Type | Mas.Vnr.Area | Exter.Qual | Exter.Cond | Foundation | Bsmt.Qual | Bsmt.Cond | Bsmt.Exposure | BsmtFin.Type.1 | BsmtFin.SF.1 | BsmtFin.Type.2 | BsmtFin.SF.2 | Bsmt.Unf.SF | Total.Bsmt.SF | Heating | Heating.QC | Central.Air | Electrical | X1st.Flr.SF | X2nd.Flr.SF | Low.Qual.Fin.SF | Gr.Liv.Area | Bsmt.Full.Bath | Bsmt.Half.Bath | Full.Bath | Half.Bath | Bedroom.AbvGr | Kitchen.AbvGr | Kitchen.Qual | TotRms.AbvGrd | Functional | Fireplaces | Fireplace.Qu | Garage.Type | Garage.Yr.Blt | Garage.Finish | Garage.Cars | Garage.Area | Garage.Qual | Garage.Cond | Paved.Drive | Wood.Deck.SF | Open.Porch.SF | Enclosed.Porch | X3Ssn.Porch | Screen.Porch | Pool.Area | Pool.QC | Fence | Misc.Feature | Misc.Val | Mo.Sold | Yr.Sold | Sale.Type | Sale.Condition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
526301100 | 20 | RL | 141 | 31770 | Pave | noAlleyAccess | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 5 | 1960 | 1960 | Hip | CompShg | BrkFace | Plywood | Stone | 112 | TA | TA | CBlock | TA | Gd | Gd | BLQ | 639 | Unf | 0 | 441 | 1080 | GasA | Fa | Y | SBrkr | 1656 | 0 | 0 | 1656 | 1 | 0 | 1 | 0 | 3 | 1 | TA | 7 | Typ | 2 | Gd | Attchd | 1960 | Fin | 2 | 528 | TA | TA | P | 210 | 62 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 5 | 2010 | WD | Normal | 215000 |
526350040 | 20 | RH | 80 | 11622 | Pave | noAlleyAccess | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | Norm | 1Fam | 1Story | 5 | 6 | 1961 | 1961 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | CBlock | TA | TA | No | Rec | 468 | LwQ | 144 | 270 | 882 | GasA | TA | Y | SBrkr | 896 | 0 | 0 | 896 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | noFireplace | Attchd | 1961 | Unf | 1 | 730 | TA | TA | Y | 140 | 0 | 0 | 0 | 120 | 0 | noPool | MnPrv | none | 0 | 6 | 2010 | WD | Normal | 105000 |
526351010 | 20 | RL | 81 | 14267 | Pave | noAlleyAccess | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 6 | 1958 | 1958 | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | 108 | TA | TA | CBlock | TA | TA | No | ALQ | 923 | Unf | 0 | 406 | 1329 | GasA | TA | Y | SBrkr | 1329 | 0 | 0 | 1329 | 0 | 0 | 1 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | noFireplace | Attchd | 1958 | Unf | 1 | 312 | TA | TA | Y | 393 | 36 | 0 | 0 | 0 | 0 | noPool | noFence | Gar2 | 12500 | 6 | 2010 | WD | Normal | 172000 |
526353030 | 20 | RL | 93 | 11160 | Pave | noAlleyAccess | Reg | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1968 | 1968 | Hip | CompShg | BrkFace | BrkFace | None | 0 | Gd | TA | CBlock | TA | TA | No | ALQ | 1065 | Unf | 0 | 1045 | 2110 | GasA | Ex | Y | SBrkr | 2110 | 0 | 0 | 2110 | 1 | 0 | 2 | 1 | 3 | 1 | Ex | 8 | Typ | 2 | TA | Attchd | 1968 | Fin | 2 | 522 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 4 | 2010 | WD | Normal | 244000 |
527105010 | 60 | RL | 74 | 13830 | Pave | noAlleyAccess | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 5 | 5 | 1997 | 1998 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | Gd | TA | No | GLQ | 791 | Unf | 0 | 137 | 928 | GasA | Gd | Y | SBrkr | 928 | 701 | 0 | 1629 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1997 | Fin | 2 | 482 | TA | TA | Y | 212 | 34 | 0 | 0 | 0 | 0 | noPool | MnPrv | none | 0 | 3 | 2010 | WD | Normal | 189900 |
527105030 | 60 | RL | 78 | 9978 | Pave | noAlleyAccess | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 6 | 1998 | 1998 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 20 | TA | TA | PConc | TA | TA | No | GLQ | 602 | Unf | 0 | 324 | 926 | GasA | Ex | Y | SBrkr | 926 | 678 | 0 | 1604 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 1998 | Fin | 2 | 470 | TA | TA | Y | 360 | 36 | 0 | 0 | 0 | 0 | noPool | noFence | none | 0 | 6 | 2010 | WD | Normal | 195500 |
The key quantities returned by the package are
Tip: omit or impute missing data
The default setting o <- rfsrc(…, na.action = “na.omit”) will provide complete cases in o$yvar and o$xvar
impute() is useful for imputing missing data
PID | MS.SubClass | MS.Zoning | Lot.Frontage | Lot.Area | Street | Alley | Lot.Shape | Land.Contour | Lot.Frontage.1 | Garage.Yr.Blt |
---|---|---|---|---|---|---|---|---|---|---|
527402200 | 20 | RL | NA | 11241 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1970 |
527402250 | 20 | RL | NA | 12537 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1971 |
528240070 | 60 | RL | NA | 7851 | Pave | noAlleyAccess | Reg | Lvl | NA | 2002 |
527425090 | 20 | RL | 70 | 10500 | Pave | noAlleyAccess | Reg | Lvl | 70 | NA |
534276360 | 20 | RL | 77 | 9320 | Pave | noAlleyAccess | IR1 | Lvl | 77 | NA |
909131125 | 190 | RH | NA | 7082 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
534427010 | 90 | RL | 98 | 13260 | Pave | noAlleyAccess | IR1 | Lvl | 98 | NA |
534450180 | 20 | RL | 50 | 7207 | Pave | noAlleyAccess | IR1 | Lvl | 50 | NA |
904351040 | 70 | C (all) | NA | 6449 | Pave | noAlleyAccess | IR1 | Lvl | NA | NA |
903200050 | 30 | RL | NA | 7446 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
Tip: omit or impute missing data
The default setting o <- rfsrc(…, na.action = “na.omit”) will provide complete cases in o$yvar and o$xvar
impute() is used for imputing missing data
PID | MS.SubClass | MS.Zoning | Lot.Frontage | Lot.Area | Street | Alley | Lot.Shape | Land.Contour | Lot.Frontage.1 | Garage.Yr.Blt |
---|---|---|---|---|---|---|---|---|---|---|
527402200 | 20 | RL | NA | 11241 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1970 |
527402250 | 20 | RL | NA | 12537 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1971 |
528240070 | 60 | RL | NA | 7851 | Pave | noAlleyAccess | Reg | Lvl | NA | 2002 |
527425090 | 20 | RL | 70 | 10500 | Pave | noAlleyAccess | Reg | Lvl | 70 | NA |
534276360 | 20 | RL | 77 | 9320 | Pave | noAlleyAccess | IR1 | Lvl | 77 | NA |
909131125 | 190 | RH | NA | 7082 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
534427010 | 90 | RL | 98 | 13260 | Pave | noAlleyAccess | IR1 | Lvl | 98 | NA |
534450180 | 20 | RL | 50 | 7207 | Pave | noAlleyAccess | IR1 | Lvl | 50 | NA |
904351040 | 70 | C (all) | NA | 6449 | Pave | noAlleyAccess | IR1 | Lvl | NA | NA |
903200050 | 30 | RL | NA | 7446 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
Tip: omit or impute missing data
The default setting o <- rfsrc(…, na.action = “na.omit”) will provide complete cases in o$yvar and o$xvar
impute() is used for imputing missing data
PID | MS.SubClass | MS.Zoning | Lot.Frontage | Lot.Area | Street | Alley | Lot.Shape | Land.Contour | Lot.Frontage.1 | Garage.Yr.Blt |
---|---|---|---|---|---|---|---|---|---|---|
527402200 | 20 | RL | NA | 11241 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1970 |
527402250 | 20 | RL | NA | 12537 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1971 |
528240070 | 60 | RL | NA | 7851 | Pave | noAlleyAccess | Reg | Lvl | NA | 2002 |
527425090 | 20 | RL | 70 | 10500 | Pave | noAlleyAccess | Reg | Lvl | 70 | NA |
534276360 | 20 | RL | 77 | 9320 | Pave | noAlleyAccess | IR1 | Lvl | 77 | NA |
909131125 | 190 | RH | NA | 7082 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
534427010 | 90 | RL | 98 | 13260 | Pave | noAlleyAccess | IR1 | Lvl | 98 | NA |
534450180 | 20 | RL | 50 | 7207 | Pave | noAlleyAccess | IR1 | Lvl | 50 | NA |
904351040 | 70 | C (all) | NA | 6449 | Pave | noAlleyAccess | IR1 | Lvl | NA | NA |
903200050 | 30 | RL | NA | 7446 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
nrow(housing)
> [1] 2930
# Use supervised OTF imputation
housing.im <- impute(SalePrice~., data = housing)
nrow(housing.im)
> [1] 2930
housing[c(12,15,23,24,25,56,28,120,2846,126,130,214,2671),c(1:9,4,60)]
> PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley Lot.Shape Land.Contour Lot.Frontage.1 Garage.Yr.Blt
> 12 527165230 20 RL NA 7980 Pave noAlleyAccess IR1 Lvl NA 1992
> 15 527182190 120 RL NA 6820 Pave noAlleyAccess IR1 Lvl NA 1985
> 23 527368020 60 FV NA 7500 Pave noAlleyAccess Reg Lvl NA 2000
> 24 527402200 20 RL NA 11241 Pave noAlleyAccess IR1 Lvl NA 1970
> 25 527402250 20 RL NA 12537 Pave noAlleyAccess IR1 Lvl NA 1971
> 56 528240070 60 RL NA 7851 Pave noAlleyAccess Reg Lvl NA 2002
> 28 527425090 20 RL 70 10500 Pave noAlleyAccess Reg Lvl 70 NA
> 120 534276360 20 RL 77 9320 Pave noAlleyAccess IR1 Lvl 77 NA
> 2846 909131125 190 RH NA 7082 Pave noAlleyAccess Reg Lvl NA NA
> 126 534427010 90 RL 98 13260 Pave noAlleyAccess IR1 Lvl 98 NA
> 130 534450180 20 RL 50 7207 Pave noAlleyAccess IR1 Lvl 50 NA
> 214 904351040 70 C (all) NA 6449 Pave noAlleyAccess IR1 Lvl NA NA
> 2671 903200050 30 RL NA 7446 Pave noAlleyAccess Reg Lvl NA NA
nrow(housing)
> [1] 2930
# do quick and dirty imputation
housing.im <- impute(SalePrice~., data = housing)
nrow(housing.im)
> [1] 2930
housing.im[c(12,15,23,24,25,56,28,120,2846,126,130,214,2671),c(1:9,4,60)]
> PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley Lot.Shape Land.Contour Lot.Frontage.1 Garage.Yr.Blt
> 12 527165230 20 RL 68.85565 7980 Pave noAlleyAccess IR1 Lvl 68.85565 1992.000
> 15 527182190 120 RL 68.56751 6820 Pave noAlleyAccess IR1 Lvl 68.56751 1985.000
> 23 527368020 60 FV 66.21796 7500 Pave noAlleyAccess Reg Lvl 66.21796 2000.000
> 24 527402200 20 RL 74.53002 11241 Pave noAlleyAccess IR1 Lvl 74.53002 1970.000
> 25 527402250 20 RL 80.12155 12537 Pave noAlleyAccess IR1 Lvl 80.12155 1971.000
> 56 528240070 60 RL 71.88750 7851 Pave noAlleyAccess Reg Lvl 71.88750 2002.000
> 28 527425090 20 RL 70.00000 10500 Pave noAlleyAccess Reg Lvl 70.00000 1965.035
> 120 534276360 20 RL 77.00000 9320 Pave noAlleyAccess IR1 Lvl 77.00000 1961.286
> 2846 909131125 190 RH 63.99655 7082 Pave noAlleyAccess Reg Lvl 63.99655 1946.405
> 126 534427010 90 RL 98.00000 13260 Pave noAlleyAccess IR1 Lvl 98.00000 1968.485
> 130 534450180 20 RL 50.00000 7207 Pave noAlleyAccess IR1 Lvl 50.00000 1962.214
> 214 904351040 70 C (all) 64.52222 6449 Pave noAlleyAccess IR1 Lvl 64.52222 1942.500
> 2671 903200050 30 RL 66.49275 7446 Pave noAlleyAccess Reg Lvl 66.49275 1952.198
Tip: model with missing data
rfsrc(…, na.action = “na.impute”) implements on-the-fly imputation
PID | MS.SubClass | MS.Zoning | Lot.Frontage | Lot.Area | Street | Alley | Lot.Shape | Land.Contour | Lot.Frontage.1 | Garage.Yr.Blt |
---|---|---|---|---|---|---|---|---|---|---|
527402200 | 20 | RL | NA | 11241 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1970 |
527402250 | 20 | RL | NA | 12537 | Pave | noAlleyAccess | IR1 | Lvl | NA | 1971 |
528240070 | 60 | RL | NA | 7851 | Pave | noAlleyAccess | Reg | Lvl | NA | 2002 |
527425090 | 20 | RL | 70 | 10500 | Pave | noAlleyAccess | Reg | Lvl | 70 | NA |
534276360 | 20 | RL | 77 | 9320 | Pave | noAlleyAccess | IR1 | Lvl | 77 | NA |
909131125 | 190 | RH | NA | 7082 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
534427010 | 90 | RL | 98 | 13260 | Pave | noAlleyAccess | IR1 | Lvl | 98 | NA |
534450180 | 20 | RL | 50 | 7207 | Pave | noAlleyAccess | IR1 | Lvl | 50 | NA |
904351040 | 70 | C (all) | NA | 6449 | Pave | noAlleyAccess | IR1 | Lvl | NA | NA |
903200050 | 30 | RL | NA | 7446 | Pave | noAlleyAccess | Reg | Lvl | NA | NA |
o <- rfsrc(SalePrice~., data = housing)
o
Sample size: 2274
Number of trees: 500
Forest terminal node size: 5
Average no. of terminal nodes: 310.83
No. of variables tried at each split: 27
Total no. of variables: 80
Resampling used to grow trees: swor
Resample size used to grow trees: 1437
Analysis: RF-R
Family: regr
Splitting rule: mse *random*
Number of random split points: 10
(OOB) R squared: 0.90907743
(OOB) Requested performance error: 631482361.907825
o <- rfsrc(SalePrice~., data = housing, na.action = "na.impute")
o
Sample size: 2930
Was data imputed: yes
Number of trees: 500
Forest terminal node size: 5
Average no. of terminal nodes: 402.424
No. of variables tried at each split: 27
Total no. of variables: 80
Resampling used to grow trees: swor
Resample size used to grow trees: 1852
Analysis: RF-R
Family: regr
Splitting rule: mse *random*
Number of random split points: 10
(OOB) R squared: 0.90785994
(OOB) Requested performance error: 588027121.705363
Tip
Monotonic transformation on the outcome won’t substantially change the structure of random forest due to its robustness. However, it may be useful for scaling the mean squared error
o <- rfsrc(SalePrice~., data= housing)
o
> Sample size: 2274
> Number of trees: 500
> Forest terminal node size: 5
> Average no. of terminal nodes: 310.594
> No. of variables tried at each split: 27
> Total no. of variables: 80
> Resampling used to grow trees: swor
> Resample size used to grow trees: 1437
> Analysis: RF-R
> Family: regr
> Splitting rule: mse *random*
> Number of random split points: 10
> (OOB) R squared: 0.90969641
> (OOB) Requested performance error: 627183369.648676
> housing$SalePrice <- log(housing$SalePrice)
> o <- rfsrc(SalePrice~., data= housing)
> o
Sample size: 2274
Number of trees: 500
Forest terminal node size: 5
Average no. of terminal nodes: 310.61
No. of variables tried at each split: 27
Total no. of variables: 80
Resampling used to grow trees: swor
Resample size used to grow trees: 1437
Analysis: RF-R
Family: regr
Splitting rule: mse *random*
Number of random split points: 10
(OOB) R squared: 0.89837879
(OOB) Requested performance error: 0.01688687
Tune nodesize
rfsrc(
formula, data, ntree = 500, mtry = NULL, ytry = NULL, nodesize = NULL, …)
nodesize
sets the minumum size of terminal, default values are:
15 for survival
15 for competing risk
5 for regression
1 for classification
3 for mixed outcomes
3 for unsupervised
We can use tune.nodesize()
> nodesize = 1 error = 10.75%
> nodesize = 2 error = 11.36%
> nodesize = 3 error = 11.55%
> nodesize = 4 error = 11.3%
> nodesize = 5 error = 11.72%
> nodesize = 6 error = 12.2%
> nodesize = 7 error = 12.75%
> nodesize = 8 error = 12.54%
> nodesize = 9 error = 13.2%
> nodesize = 10 error = 13.98%
> nodesize = 15 error = 14.88%
> nodesize = 20 error = 17.14%
> nodesize = 25 error = 19.26%
> nodesize = 30 error = 20.92%
> nodesize = 35 error = 21.64%
> nodesize = 40 error = 23.86%
> nodesize = 45 error = 24.64%
> nodesize = 50 error = 25.4%
> nodesize = 55 error = 26.87%
> optimal nodesize: 1
Family | Example Grow Call with Formula Specification |
---|---|
Regression Quantile Regression |
rfsrc(Ozone~., data = airquality) quantreg(mpg~., data = mtcars) |
Classification Imbalanced Two-Class |
rfsrc(Species~., data = iris) imbalanced(status~., data = breast)
|
Survival | rfsrc(Surv(time, status)~., data = veteran) |
Competing Risk | rfsrc(Surv(time, status)~., data = wihs) |
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
rfsrc(Multivar(mpg, cyl)~., data = mtcars) rfsrc(cbind(Species,Sepal.Length)~.,data=iris) quantreg(cbind(mpg, cyl)~., data = mtcars) quantreg(cbind(Species,Sepal.Length)~.,data=iris)
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
rfsrc(data = mtcars) sidClustering(data = mtcars) sidClustering(data = mtcars, method = "sh")
|
Quantile regression estimates the conditional distribution function (CDF) \[ F(y|\mathbf{ X}=\mathbf{ x})=P\{Y\le y|\mathbf{ X}=\mathbf{ x}\} \]
The empirical risk functional for a quantile \(0<\alpha<1\) is \[ R_n(F)=\frac{1}{n}\sum_{i=1}^n L_{\alpha}(Y_i, F(\mathbf{ x}_i)) \] where \(L_{\alpha}(Y, u)=\phi_{\alpha}(Y-u)\) is the loss function (pinball loss) and \[ \phi_{\alpha}(u)=u(\alpha-1_{\{u<0\}}) \] is the check-loss function
In regression, the tree node estimator is the sample average \[ E(Y|\mathbf{ X}\in t) = \overline{Y}_{t} \] In quantile regression it is the CDF \[ P\{Y\le y|\mathbf{ X}\in t\} = \frac{1}{n}\sum_{i\in t} 1_{\{Y_i\le y\}} \]
Family | splitrule |
---|---|
Regression Quantile Regression |
mse la.quantile.regr, quantile.regr,mse |
Classification Imbalanced Two-Class |
gini, auc, entropy gini, auc, entropy
|
Survival | logrank, bs.gradient, logrankscore |
Competing Risk | logrankCR, logrank |
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
mv.mse, mahalanobis mv.gini mv.mix mv.mse mv.mix
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
unsupv \(\{\) mv.mse, mv.gini, mv.mix \(\}\), mahalanobis gini, auc, entropy
|
quantile.regr
is the quantile splitting rule using quantile loss (pinball loss). It requires specifying the target quantiles which by default are the percentiles 1%, 2%, \(\ldots\), 99%
la.quantile.regr
refers to local adaptive quantile regression splitting (the default action for quantreg
) under default percentiles 1%, 2%, \(\ldots\), 99%
The key quantities returned by the package are
o <- quantreg(SalePrice ~ ., housing, splitrule = "mse", ntree = 250)
o <- quantreg(SalePrice ~ ., housing, splitrule = "quantile.regr", ntree = 250)
o <- quantreg(SalePrice ~ ., housing, splitrule = "la.quantile.regr", ntree = 250) # (default)
> o
Sample size: 2274
Number of trees: 250
Forest terminal node size: 5
Average no. of terminal nodes: 263.86
No. of variables tried at each split: 27
Total no. of variables: 80
Resampling used to grow trees: swor
Resample size used to grow trees: 1437
Analysis: RF-R
Family: regr
Splitting rule: la.quantile.regr *random*
Number of random split points: 10
(OOB) R squared: 0.85752024
(OOB) Requested performance error: 0.02367652
The key quantities returned by the package are
o <- quantreg(SalePrice ~ ., housing)
names(o$quantreg)
> [1] "quantiles" "prob" "cdf" "density" "yunq"
head(o$quantreg$yunq)
> [1] 9.456341 9.480368 10.463103 10.471950 10.596635 10.714418
head(o$quantreg$prob)
> [1] 0.01 0.02 0.03 0.04 0.05 0.06
head(o$quantreg$cdf[ ,1:5])
> [,1] [,2] [,3] [,4] [,5]
> [1,] 0 0 0.0008795075 0.0008795075 0.0008795075
> [2,] 0 0 0.0008795075 0.0008795075 0.0008795075
> [3,] 0 0 0.0008795075 0.0008795075 0.0008795075
> [4,] 0 0 0.0008795075 0.0008795075 0.0008795075
> [5,] 0 0 0.0008795075 0.0008795075 0.0008795075
> [6,] 0 0 0.0008795075 0.0008795075 0.0008795075
head(o$quantreg$density[ ,1:5])
> [,1] [,2] [,3] [,4] [,5]
> [1,] 0 0 0.0008795075 0 0
> [2,] 0 0 0.0008795075 0 0
> [3,] 0 0 0.0008795075 0 0
> [4,] 0 0 0.0008795075 0 0
> [5,] 0 0 0.0008795075 0 0
> [6,] 0 0 0.0008795075 0 0
head(o$quantreg$quantiles[ ,1:5])
> [,1] [,2] [,3] [,4] [,5]
> [1,] 11.53273 11.68658 11.74543 11.77913 11.80560
> [2,] 11.24505 11.40199 11.46163 11.49883 11.52288
> [3,] 11.42954 11.58058 11.64151 11.67844 11.70355
> [4,] 11.81303 11.96285 12.01974 12.05815 12.08108
> [5,] 11.64395 11.79434 11.85296 11.88793 11.91505
> [6,] 11.68856 11.83501 11.89478 11.93164 11.95761
## (A) extract 25,50,75 quantiles
quant.dat <- get.quantile(o, c(.25, .50, .75))
head(quant.dat)
> q.25 q.50 q.75
> [1,] 12.01673 12.07254 12.13081
> [2,] 11.62536 11.67844 11.73767
> [3,] 11.89341 11.94795 12.00457
> [4,] 12.35017 12.40492 12.46071
> [5,] 12.04614 12.10071 12.15767
> [6,] 12.08954 12.14420 12.20106
## (B) values expected under normality
quant.stat <- get.quantile.stat(o)
c.mean <- quant.stat$mean
c.std <- quant.stat$std
q.25.est <- c.mean + qnorm(.25) * c.std
q.75.est <- c.mean + qnorm(.75) * c.std
## compare (A) and (B)
print(head(data.frame(quant.dat[, -2], q.25.est, q.75.est)))
> q.25 q.75 q.25.est q.75.est
> 1 12.01673 12.13081 11.98342 12.15846
> 2 11.62536 11.73767 11.59368 11.76268
> 3 11.89341 12.00457 11.85888 12.03187
> 4 12.35017 12.46071 12.31451 12.49045
> 5 12.04614 12.15767 12.01076 12.18538
> 6 12.08954 12.20106 12.05462 12.22996
Runs rfsrc
on a user’s specified data and performs a detailed analysis including tuning the forests and plotting various quantities such as performance metrics and variable importance. Applies to regression, classification, survival, competing risk and multivariate families
Family | Example Grow Call with Formula Specification |
---|---|
Regression Quantile Regression |
rfsrc(Ozone~., data = airquality) quantreg(mpg~., data = mtcars)
|
Classification Imbalanced Two-Class |
rfsrc(Species~., data = iris) imbalanced(status~., data = breast)
|
Survival | rfsrc(Surv(time, status)~., data = veteran) |
Competing Risk | rfsrc(Surv(time, status)~., data = wihs) |
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
rfsrc(Multivar(mpg, cyl)~., data = mtcars) rfsrc(cbind(Species,Sepal.Length)~.,data=iris) quantreg(cbind(mpg, cyl)~., data = mtcars) quantreg(cbind(Species,Sepal.Length)~.,data=iris)
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
rfsrc(data = mtcars) sidClustering(data = mtcars) sidClustering(data = mtcars, method = "sh")
|
The classification-regression goal is to estimate \[ F_j(\mathbf{x})=P\{Y=j|\mathbf{X}=\mathbf{x}\} \hskip10pt \text{for } j=1,\ldots,J \] by minimizing the empirical risk functional, \(F=(F_1,\ldots,F_J)\), \[ R_n(F)=\frac{1}{n}\sum_{i=1}^n L(Y_i, F(\mathbf{x}_i)) \] using loss functions like Gini impurity, entropy and the Brier score
The classification goal is to classify cases, typically performed using the Bayes rule (maximal probability) \[ \delta(\mathbf{x}) = j, \hskip10pt \text{where } F_j(\mathbf{x})\ge F_{j'},(\mathbf{x})\hskip10pt j'=1,\ldots,J \]
The tree node estimator equals the relative frequencies \[ P\{Y=j|\mathbf{X}\in t\} = \dfrac{\text{number of cases in $t$ that are class label $j$}}{\text{number of cases in $t$}} \]
The key quantities returned by the package are
Subset of the data used in Ceccarelli et al. (2016) for molecular profiling of adult diffuse gliomas. It contains 1206 probes from 880 tissues, clinical data and other molecular data. The outcome has class labels: Classic-like, Codel, G-CIMP-high, G-CIMP-low, LGm6-GBM, Mesenchymal-like and PA-like [6]
y | cg23970089 | cg25176823 | cg24576735 | Chr19.20co.gain | TERT.promoter.status | SNP6 | U133a | grade | age | gender |
---|---|---|---|---|---|---|---|---|---|---|
Mesenchymal-like | 0.36518600 | 0.3244530 | 0.03255164 | No chr 19/20 gain | Mutant | Yes | Yes | G4 | 44 | female |
Mesenchymal-like | 0.60246131 | 0.3626941 | 0.02215452 | No chr 19/20 gain | Mutant | Yes | Yes | G4 | 50 | male |
Mesenchymal-like | 0.44011951 | 0.3679728 | 0.03369096 | No chr 19/20 gain | Mutant | Yes | No | G4 | 56 | female |
Classic-like | 0.06834791 | 0.3974054 | 0.03623296 | No chr 19/20 gain | Mutant | Yes | Yes | G4 | 40 | female |
Classic-like | 0.19843219 | 0.2736760 | 0.04060163 | No chr 19/20 gain | Mutant | Yes | Yes | G4 | 61 | female |
o <- rfsrc(y~., data = glioma)
o
Sample size: 878
Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
Number of trees: 500
Forest terminal node size: 1
Average no. of terminal nodes: 74.302
No. of variables tried at each split: 36
Total no. of variables: 1241
Resampling used to grow trees: swor
Resample size used to grow trees: 555
Analysis: RF-C
Family: class
Splitting rule: gini *random*
Number of random split points: 10
(OOB) Brier score: 0.02333115
(OOB) Normalized Brier score: 0.19053769
(OOB) AUC: 0.99304797
(OOB) Log-loss: 0.34115074
(OOB) Requested performance error: 0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462
Confusion matrix:
predicted
observed Classic-like Codel G-CIMP-high G-CIMP-low LGm6-GBM Mesenchymal-like PA-like class.error
Classic-like 138 0 0 0 0 10 0 0.0676
Codel 0 172 2 0 0 0 0 0.0115
G-CIMP-high 0 2 247 0 0 0 0 0.0080
G-CIMP-low 0 0 13 12 0 0 0 0.5200
LGm6-GBM 0 0 1 0 19 21 0 0.5366
Mesenchymal-like 11 0 0 0 0 204 0 0.0512
PA-like 0 0 0 0 0 3 23 0.1154
(OOB) Misclassification rate: 0.07175399
Tip
rfsrc
recognize classification problem automatically when the outcome is a factor
The key quantities returned by the package are
o$predicted ---> inbag estimated probabilities
o$predicted.oob ---> OOB estimated probabilities
o$class ---> inbag class predictions
o$class.oob ---> OOB class predictions
o$predicted[1:5,] # ---> inbag estimated probabilities
> Classic-like Codel G-CIMP-high G-CIMP-low LGm6-GBM Mesenchymal-like PA-like
> [1,] 0.006 0.002 0.000 0.004 0.084 0.892 0.012
> [2,] 0.136 0.008 0.000 0.004 0.014 0.832 0.006
> [3,] 0.064 0.004 0.008 0.010 0.018 0.892 0.004
> [4,] 0.928 0.008 0.004 0.004 0.002 0.054 0.000
> [5,] 0.936 0.000 0.000 0.002 0.000 0.062 0.000
o$predicted.oob[1:5,] # ---> OOB estimated probabilities
> Classic-like Codel G-CIMP-high G-CIMP-low LGm6-GBM Mesenchymal-like PA-like
> [1,] 0.01840491 0.006134969 0.00000000 0.012269939 0.257668712 0.6687117 0.03680982
> [2,] 0.38857143 0.022857143 0.00000000 0.011428571 0.040000000 0.5200000 0.01714286
> [3,] 0.17391304 0.010869565 0.02173913 0.027173913 0.048913043 0.7065217 0.01086957
> [4,] 0.81443299 0.020618557 0.01030928 0.010309278 0.005154639 0.1391753 0.00000000
> [5,] 0.82222222 0.000000000 0.00000000 0.005555556 0.000000000 0.1722222 0.00000000
o$class[1:5] # ---> inbag class predictions
> [1] Mesenchymal-like Mesenchymal-like Mesenchymal-like Classic-like Classic-like
o$class.oob[1:5] # ---> OOB class predictions
> [1] Mesenchymal-like Mesenchymal-like Mesenchymal-like Classic-like Classic-like
Family | splitrule |
---|---|
Regression Quantile Regression |
mse la.quantile.regr, quantile.regr, mse
|
Classification Imbalanced Two-Class |
gini, auc, entropy gini, auc, entropy
|
Survival | logrank, bs.gradient, logrankscore |
Competing Risk | logrankCR, logrank |
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
mv.mse, mahalanobis mv.gini mv.mix mv.mse mv.mix
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
unsupv \(\{\) mv.mse, mv.gini, mv.mix \(\}\), mahalanobis gini, auc, entropy
|
Learn more [7]: https://www.randomforestsrc.org/articles/aucsplit.html
o <- rfsrc(y~., data = glioma,
splitrule="gini") ## default splitrule as in the previous slide
o
Sample size: 878
Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
Number of trees: 500
Forest terminal node size: 1
Average no. of terminal nodes: 74.302
No. of variables tried at each split: 36
Total no. of variables: 1241
Resampling used to grow trees: swor
Resample size used to grow trees: 555
Analysis: RF-C
Family: class
Splitting rule: gini *random*
Number of random split points: 10
(OOB) Brier score: 0.02333115
(OOB) Normalized Brier score: 0.19053769
(OOB) AUC: 0.99304797
(OOB) Log-loss: 0.34115074
(OOB) Requested performance error: 0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462
See Chapter 4.3 in Breiman et al. 1984 [4]
o <- rfsrc(y~., data = glioma,
splitrule="auc")
o
Sample size: 878
Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
Number of trees: 500
Forest terminal node size: 1
Average no. of terminal nodes: 434.522
No. of variables tried at each split: 36
Total no. of variables: 1241
Resampling used to grow trees: swor
Resample size used to grow trees: 555
Analysis: RF-C
Family: class
Splitting rule: auc *random*
Number of random split points: 10
(OOB) Brier score: 0.04799401
(OOB) Normalized Brier score: 0.39195105
(OOB) AUC: 0.97510083
(OOB) Log-loss: 0.62743109
(OOB) Requested performance error: 0.16514806, 0.25675676, 0.09770115, 0.01606426, 0.84, 0.97560976, 0.04651163, 0.57692308
Tip
AUC splitting is appropriate for imbalanced data
o <- rfsrc(y~., data = glioma,
splitrule="entropy")
o
Sample size: 878
Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
Number of trees: 500
Forest terminal node size: 1
Average no. of terminal nodes: 287.266
No. of variables tried at each split: 36
Total no. of variables: 1241
Resampling used to grow trees: swor
Resample size used to grow trees: 555
Analysis: RF-C
Family: class
Splitting rule: entropy *random*
Number of random split points: 10
(OOB) Brier score: 0.04504941
(OOB) Normalized Brier score: 0.36790354
(OOB) AUC: 0.96636193
(OOB) Log-loss: 0.64571496
(OOB) Requested performance error: 0.15148064, 0.16216216, 0.04022989, 0.01204819, 0.72, 1, 0.10232558, 0.69230769
See Chapters 2.5 and 4.3 in Breiman et al. 1984 [4]
Family | Example Grow Call with Formula Specification |
---|---|
Regression Quantile Regression |
rfsrc(Ozone~., data = airquality) quantreg(mpg~., data = mtcars)
|
Classification Imbalanced Two-Class |
rfsrc(Species~., data = iris) imbalanced(status~., data = breast)
|
Survival | rfsrc(Surv(time, status)~., data = veteran) |
Competing Risk | rfsrc(Surv(time, status)~., data = wihs) |
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
rfsrc(Multivar(mpg, cyl)~., data = mtcars) rfsrc(cbind(Species,Sepal.Length)~.,data=iris) quantreg(cbind(mpg, cyl)~., data = mtcars) quantreg(cbind(Species,Sepal.Length)~.,data=iris)
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
rfsrc(data = mtcars) sidClustering(data = mtcars) sidClustering(data = mtcars, method = "sh")
|
Learn more [8]: https://www.randomforestsrc.org/articles/survival.html
In survival one of the primary goals is estimating the survival function \[ S(t|\mathbf{X}=\mathbf{x}) = P\{T^o> t|\mathbf{X}=\mathbf{x}\} \] where \(T^o\) is the survival time
Random survival forests [9] uses the KM (Kaplan-Meier) estimator with the general estimation strategy (product limit estimator) \[ S(t) = S(t_1) \times S(t_2|t_1) \times \cdots \times S(t_M|t_{M-1}) \] with \[ S(t_j|t_{j-1}) = \frac{\text{number surviving beyond $t=t_j$}}{\text{number surviving beyond $t>t_{j-1}$}}. \]
This deals with generalized Type-I censoring where patients enter a study at different times and length of follow-up time varies for each patient and where time of event is unobserved for some patients due to the study ending
The tree node estimator is the KM estimator. Tree nodes are split using the log-rank statistic
The key quantities returned by the package are
o$time.interest ---> event times (everything keys off this)
o$predicted ---> inbag estimated mortality
o$predicted.oob ---> OOB estimated mortality
o$survival ---> inbag survival estimator for each case
o$survival.oob ---> OOB survival estimator for each case
o$chf ---> inbag CHF estimator for each case
o$chf.oob ---> OOB CHF estimator for each case
This data is from the Mayo Clinic trial in Primary Biliary Cirrhosis (PBC) conducted between 1974 and 1984. Among the total 424 PBC patients, the first 312 cases participated in a randomized placebo controlled trial of the drug D-penicillamine with largely complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements and to be followed for survival [10]
id | time | status | trt | age | sex | ascites | hepato | spiders | edema | bili | chol | albumin | copper | alk.phos | ast | trig | platelet | protime | stage |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 400 | 2 | 1 | 58.76523 | f | 1 | 1 | 1 | 1.0 | 14.5 | 261 | 2.60 | 156 | 1718.0 | 137.95 | 172 | 190 | 12.2 | 4 |
2 | 4500 | 0 | 1 | 56.44627 | f | 0 | 1 | 1 | 0.0 | 1.1 | 302 | 4.14 | 54 | 7394.8 | 113.52 | 88 | 221 | 10.6 | 3 |
3 | 1012 | 2 | 1 | 70.07255 | m | 0 | 0 | 0 | 0.5 | 1.4 | 176 | 3.48 | 210 | 516.0 | 96.10 | 55 | 151 | 12.0 | 4 |
4 | 1925 | 2 | 1 | 54.74059 | f | 0 | 1 | 1 | 0.5 | 1.8 | 244 | 2.54 | 64 | 6121.8 | 60.63 | 92 | 183 | 10.3 | 4 |
5 | 1504 | 1 | 2 | 38.10541 | f | 0 | 1 | 1 | 0.0 | 3.4 | 279 | 3.53 | 143 | 671.0 | 113.15 | 72 | 136 | 10.9 | 3 |
6 | 2503 | 2 | 2 | 66.25873 | f | 0 | 1 | 0 | 0.0 | 0.8 | 248 | 3.98 | 50 | 944.0 | 93.00 | 63 | NA | 11.0 | 3 |
pbc$id <- NULL ## remove the ID
## keep the original competing risk framework for later
## status at endpoint, 0/1/2 for censored, transplant, dead
pbc.cr <- pbc
## convert to right-censoring with death as the event
pbc$status[pbc$status > 0] <- 1
o <- rfsrc(Surv(time, status) ~ ., data = pbc)
o
Sample size: 276
Number of deaths: 129
Number of trees: 500
Forest terminal node size: 15
Average no. of terminal nodes: 14.326
No. of variables tried at each split: 5
Total no. of variables: 17
Resampling used to grow trees: swor
Resample size used to grow trees: 174
Analysis: RSF
Family: surv
Splitting rule: logrank *random*
Number of random split points: 10
(OOB) CRPS: 553.15107271
(OOB) stand. CRPS: 0.13198546
(OOB) Requested performance error: 0.18814768
o$survival ---> inbag survival estimator for each case
o$survival.oob ---> OOB survival estimator for each case
dim(o$survival) # ---> inbag survival estimator for each case
> [1] 276 127
dim(o$survival.oob) # ---> OOB survival estimator for each case
> [1] 276 127
o$survival[1:5, 1:5]
> [,1] [,2] [,3] [,4] [,5]
> [1,] 0.9600951 0.9125329 0.9002843 0.8725504 0.8423145
> [2,] 1.0000000 0.9999167 0.9997543 0.9996710 0.9995136
> [3,] 0.9959351 0.9921702 0.9877700 0.9782302 0.9768478
> [4,] 0.9954334 0.9906795 0.9871662 0.9857532 0.9841598
> [5,] 1.0000000 0.9998401 0.9967281 0.9947621 0.9946241
o$survival.oob[1:5, 1:5]
> [,1] [,2] [,3] [,4] [,5]
> [1,] 0.9655403 0.9160407 0.9050698 0.8755058 0.8428749
> [2,] 1.0000000 1.0000000 0.9995281 0.9995281 0.9995281
> [3,] 0.9957842 0.9926934 0.9881592 0.9768084 0.9752430
> [4,] 0.9942668 0.9873056 0.9850606 0.9838238 0.9819504
> [5,] 1.0000000 1.0000000 0.9962979 0.9942830 0.9939347
dim(o$chf)
> [1] 276 127
dim(o$chf.oob)
> [1] 276 127
o$chf[1:5, 1:5]
> [,1] [,2] [,3] [,4] [,5]
> [1,] 0.039904941 8.981063e-02 0.103197429 0.1338071668 0.1686493781
> [2,] 0.000000000 8.333333e-05 0.000245671 0.0003326275 0.0004976107
> [3,] 0.004064919 7.977572e-03 0.012491370 0.0227012085 0.0244150195
> [4,] 0.004566626 9.643876e-03 0.013209059 0.0146954141 0.0163487486
> [5,] 0.000000000 1.598746e-04 0.003271896 0.0052661647 0.0054040957
o$chf.oob[1:5, 1:5]
> [,1] [,2] [,3] [,4] [,5]
> [1,] 0.034459694 0.085894243 0.0978252512 0.1301985027 0.1682733452
> [2,] 0.000000000 0.000000000 0.0004719118 0.0004719118 0.0004719118
> [3,] 0.004215783 0.007519603 0.0122166452 0.0244321624 0.0264688132
> [4,] 0.005733225 0.013259881 0.0155376855 0.0168153866 0.0187680623
> [5,] 0.000000000 0.000000000 0.0037021133 0.0057619372 0.0061102479
Connection between mortality and the CHF
Mortality for a case i equals the expected number of deaths if all patients were the same as case i. Mortality is obtained by summing the CHF
Family | splitrule |
---|---|
Regression Quantile Regression |
mse la.quantile.regr, quantile.regr, mse
|
Classification Imbalanced Two-Class |
gini, auc, entropy gini, auc, entropy |
Survival | logrank, bs.gradient, logrankscore |
Competing Risk | logrankCR, logrank |
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression |
mv.mse, mahalanobis mv.gini mv.mix mv.mse mv.mix
|
Unsupervised sidClustering Breiman (Shi-Horvath) |
unsupv \(\{\) mv.mse, mv.gini, mv.mix \(\}\), mahalanobis gini, auc, entropy
|
o <- rfsrc(Surv(time, status) ~ ., data = pbc,
splitrule="logrank") ## default splitrule
o
Sample size: 276
Number of deaths: 129
Number of trees: 500
Forest terminal node size: 15
Average no. of terminal nodes: 14.326
No. of variables tried at each split: 5
Total no. of variables: 17
Resampling used to grow trees: swor
Resample size used to grow trees: 174
Analysis: RSF
Family: surv
Splitting rule: logrank *random*
Number of random split points: 10
(OOB) CRPS: 553.15107271
(OOB) stand. CRPS: 0.13198546
(OOB) Requested performance error: 0.18814768
o <- rfsrc(Surv(time, status) ~ ., data = pbc,
splitrule="bs.gradient")
o
Sample size: 276
Number of deaths: 129
Number of trees: 500
Forest terminal node size: 15
Average no. of terminal nodes: 13.226
No. of variables tried at each split: 5
Total no. of variables: 17
Resampling used to grow trees: swor
Resample size used to grow trees: 174
Analysis: RSF
Family: surv
Splitting rule: bs.gradient *random*
Number of random split points: 10
(OOB) CRPS: 570.11311519
(OOB) stand. CRPS: 0.13603272
(OOB) Requested performance error: 0.19299269
Tip
The time horizon used for the Brier score is set to the 90th percentile of the observed event times. This can be over-ridden by the option prob, which must be a value between 0 and 1 (set to .90 by default)
o <- rfsrc(Surv(time, status) ~ ., data = pbc,
splitrule="logrankscore")
o
Sample size: 276
Number of deaths: 129
Number of trees: 500
Forest terminal node size: 15
Average no. of terminal nodes: 13.46
No. of variables tried at each split: 5
Total no. of variables: 17
Resampling used to grow trees: swor
Resample size used to grow trees: 174
Analysis: RSF
Family: surv
Splitting rule: logrankscore *random*
Number of random split points: 10
(OOB) CRPS: 626.32256755
(OOB) stand. CRPS: 0.14944466
(OOB) Requested performance error: 0.19231977