Tree-Based Machine Learning Methods for Prediction, Variable Selection and Causal Inference

Part I

Hemant Ishwaran and Min Lu
University of Miami

Brief Overview

This cover covers the following R packages

randomForestSRC: Fast Unified Random Forests for Survival, Regression, and Classification (CRAN link)
randomForestSRC.run: Integrated visualization for the randomForestSRC package (Github link)
varPro: Model-Independent Variable Selection via the Rule-Based Variable Priority (Github link)

Outline

Part I: Training

Quick start
Data structures allowed
Training (grow) with examples
(regression, classification, survival)

Part II: Inference and Prediction

Inference (OOB)
Prediction Error
Prediction
Restore
Partial Plots

Part III: Variable Selection

VIMP
Subsampling (Confidence Intervals)
Minimal Depth
VarPro

Part IV: Advanced Examples

Class Imbalanced Data
Competing Risks
Multivariate Forests
Missing data imputation

Brief Overview

randomForestSRC is a fast OpenMP and memory efficient package for fitting random forests (RF) for univariate, multivariate, unsupervised, survival, competing risks, class imbalanced classification and quantile regression

Brief Overview

A random forest (RF) [1] consists of a collection of random tree structured predictors {$h({\bf x},{{\bf\Theta}_b}), b= 1,2,\ldots, B$} where {${{\bf\Theta}_b}$} are independent identically distributed random vectors. The RF predictor is the ensemble \[F_{{\!\text{ens}}}({\bf x})=\frac{1}{B}\sum_{b=1}^B h({\bf x},{{\bf\Theta}_b})\]

Typically, ${{\bf\Theta}_b}$ encode the instructions for bootstrapping the data and for random feature selection used for tree splitting

Brief Overview

rfsrc(formula, data, ntree, mtry, nodesize)

Grow (train) your RF through rfsrc
Specify your model in formula
Provide your data frame in data
Tune your model via ntree, mtry, nodesize

Brief Overview

rfsrc(formula, data, ntree, mtry, nodesize)

Family	Example Grow Call with Formula Specification
Survival	`rfsrc(Surv(time, status)~., data = veteran)`
Competing Risk	`rfsrc(Surv(time, status)~., data = wihs)`
Regression Quantile Regression	`rfsrc(Ozone~., data = airquality)` `quantreg(mpg~., data = mtcars)`
Classification Imbalanced Two-Class	`rfsrc(Species~., data = iris)` `imbalanced(status~., data = breast)`
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`rfsrc(Multivar(mpg, cyl)~., data = mtcars)` `rfsrc(cbind(Species,Sepal.Length)~.,data=iris)` `quantreg(cbind(mpg, cyl)~., data = mtcars)` `quantreg(cbind(Species,Sepal.Length)~.,data=iris)`
Unsupervised sidClustering Breiman (Shi-Horvath)	`rfsrc(data = mtcars)` `sidClustering(data = mtcars)` `sidClustering(data = mtcars, method = "sh")`

Quick Start

# Run this if you haven’t installed the R package
install.packages("randomForestSRC", repos = "https://cran.us.r-project.org")

Quick Start

flowchart LR
  A[Grow a forest] --> B[Check the output] --> C[Make the prediction]

Quick Start

flowchart LR
  A[<h3 style="color:darkgreen">Grow a forest</h3>] --> B[Check the output] --> C[Make the prediction]

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)

Quick Start

flowchart LR
  A[Grow a forest] --> B[<h3 style="color:darkgreen">Check the output</h3>] --> C[Make the prediction]

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Quick Start

Tip

The last line(s) will show the cross validated prediction performance

>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Model performance is displayed in the last two lines in terms of out-of-bag (OOB) prediction error. In regression, this the cross-validated (leave-one-out bootstrap) mean squared error (MSE) estimated via the out-of-bag data shown in line 14
Since MSE is scale dependent, standardized MSE, defined as the MSE divided by the variance of the outcome, is used and converted to R squared (or percent variance explained) which has an intuitive and universal interpretation, shown in line 13

What is OOB prediction error?

Since each tree is grown from a random subset (bootstrap) of the data, the package returns both out-of-sample and in-sample predicted values from the forest, where the former are calculated using the hold out data for each tree ($predicted.oob), and the latter are from the bootstrap data used to train the tree ($predicted)
Only $predicted.oob should be used for inference on the training data since they are cross-validated and will not over-fit the data.

>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

we can get the error rate (mean-squared error) directly from the OOB ensemble by comparing the response to the OOB predictor below

print(mean((airq.obj$yvar - airq.obj$predicted.oob)^2))
> [1] 322.53346594

Prediction error

Family	Prediction Error
Regression Quantile Regression	mean-squared error mean-squared error
Classification Imbalanced Two-Class	misclassification, Brier score G-mean, misclassification, Brier score
Survival	Harrell’s C-index (1 minus concordance)
Competing Risk	Harrell’s C-index (1 minus concordance)
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression	per response: same as above for Regression per response: same as above for Classification per response: same as above for Regression, Classification same as Multivariate Regression
Unsupervised sidClustering Breiman (Shi-Horvath)	none same as Multivariate Mixed Regression same as Classification

Quick Start

Output (Values/Verbose)

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Corresponding input (arguments)

rfsrc(…, ntree = 500)

Quick Start

Output (Values/Verbose)

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Corresponding input (arguments)

rfsrc(…, nodesize = 5)

line 4 displays the average number of terminal nodes for a tree

Quick Start

Output (Values/Verbose)

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Corresponding input (arguments)

rfsrc(…, mtry = NULL)

lines 5 and 6 are related to mtry, which is the number of variables to possibly split at each node

Default is number of variables divided by 3 for regression. For all other families (including unsupervised settings), it is the square root of number of variables. Values are rounded up

Quick Start

Output (Values/Verbose)

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Corresponding input (arguments)

rfsrc(…, samptype ="swor")

where “swor” refers to sampling without replacement and “swr” refers to sampling with replacement

line 8 displays the sample size for line 7 where for swor, the number equals to about 63.2% observations, which matches the usual bootstrap sample size

Quick Start

Output (Values/Verbose)

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Explanation

line 9 and 10 show the type of forest where RF-R and regr refer to regression

Quick Start

Output (Values/Verbose)

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Corresponding input (arguments)

rfsrc(…, splitrule =“mse")

Quick Start

Output (Values/Verbose)

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality)
print(airq.obj)


>  1                           Sample size: 111
>  2                       Number of trees: 500
>  3             Forest terminal node size: 5
>  4         Average no. of terminal nodes: 14.68
>  5  No. of variables tried at each split: 2
>  6                Total no. of variables: 5
>  7         Resampling used to grow trees: swor
>  8      Resample size used to grow trees: 70
>  9                              Analysis: RF-R
>  10                               Family: regr
>  11                       Splitting rule: mse *random*
>  12        Number of random split points: 10
>  13                      (OOB) R squared: 0.70871819
>  14    (OOB) Requested performance error: 322.53346594

Corresponding input (arguments)

rfsrc(…, nsplit = 10)

which shows the number of random splits to consider for each candidate splitting variable

Quick Start

flowchart LR
  A[Grow a forest] --> B[Check the output] --> C[<h3 style="color:darkgreen">Make the prediction</h3>]

Predicted outcome

library(randomForestSRC)
# New York air quality measurements. Mean ozone in parts per billion.
airq.obj <- rfsrc(Ozone ~ ., data = airquality[-c(1:10),])
head(airq.obj$predicted.oob)
>  21.97727  25.38251  21.26051  12.87687  24.10623

Prediction on new data

airq.pred <- predict(airq.obj, newdata = airquality[c(1:10),])
head(airq.pred$predicted)
>  35.43525 23.75033 25.86907 24.55133 29.10551

Data

Data structures allowed

Data must be real or factors and in data.frame format and you have to use the as.data.frame() command to convert matrix, tibble, etc

Clean up your data

Choose your variables in formula and grow a tree

o <- rfsrc(y ~ a + z, data= dta, ntree = 1)

your outcome(s) will be saved in o$yvar and your predictors are in o$xvar from dta without missing values

Impute your data

To impute your data, use the following commands

o <- impute(y ~ ., data = dta, nimpute = 3) # for supervised problem
o <- impute(data = dta) # for unsupervised problem

Tip: factors or characters

Unlike lm(), glm(), or machine learning packages like ranger, rfsrc() does not require contrast/dummy coding

General call to rfsrc

General call to rfsrc.cart

rfsrc.cart(formula, data, ntree = 1, mtry = ncol(data), bootstrap = “none”, …)

Tip: convenient interface for growing a Classification And Regression Tree (CART)

rfsrc.cart() is useful for growing single trees of almost any type

Visualizing a CART tree

Tip:

get.tree() is useful for visualizing a single tree from a rfsrc.cart or rfsrc object specified by tree.id

## visualizing a tree from a rfsrc.cart object
airq.tree <- rfsrc.cart(Ozone ~ ., data = airquality)
plot(get.tree(airq.tree, tree.id = 1))

Visualizing a CART tree

Tip:

get.tree() is useful for visualizing a single tree from a rfsrc.cart or rfsrc object specified by tree.id

## visualizing a tree from a rfsrc object
plot(get.tree(airq.obj, tree.id = 3))

Nonparametric regression

Family	Example Grow Call with Formula Specification
Regression Quantile Regression	rfsrc(Ozone~., data = airquality) `quantreg(mpg~., data = mtcars)`
Classification Imbalanced Two-Class	`rfsrc(Species~., data = iris)` `imbalanced(status~., data = breast)`
Survival	`rfsrc(Surv(time, status)~., data = veteran)`
Competing Risk	`rfsrc(Surv(time, status)~., data = wihs)`
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`rfsrc(Multivar(mpg, cyl)~., data = mtcars)` `rfsrc(cbind(Species,Sepal.Length)~.,data=iris)` `quantreg(cbind(mpg, cyl)~., data = mtcars)` `quantreg(cbind(Species,Sepal.Length)~.,data=iris)`
Unsupervised sidClustering Breiman (Shi-Horvath)	`rfsrc(data = mtcars)` `sidClustering(data = mtcars)` `sidClustering(data = mtcars, method = "sh")`

Nonparametric regression

The goal is to estimate $F({\bf{x}})=E(Y|{\bf{X}}={\bf{x}})$ by minimizing the empirical risk functional \[ R_n(F) = \frac{1}{n}\sum_{i=1}^n (Y_i-F({\bf{x}}_i))^2 \] The tree accomplishes this piece-by-piece. For a node $t$, the tree replaces the node estimator ${\overline{Y}}_t$ by daughter node estimators ${\overline{Y}}_{t_L}$ and ${\overline{Y}}_{t_R}$

Nonparametric regression

The empirical risk reduction from this split of $t$ into its daughers $t_L$ and $t_R$ is \[ \frac{1}{n}\sum_{i\in t} (Y_i-{\overline{Y}}_t)^2 -\Bigg[\frac{1}{n}\sum_{i\in t_L} (Y_i-{\overline{Y}}_{t_L})^2 + \frac{1}{n}\sum_{i\in t_R} (Y_i-{\overline{Y}}_{t_R})^2\Bigg] \] Maximizing this decrease is equivalent to minimizing \[ \frac{1}{n}\sum_{i\in t_L} (Y_i-{\overline{Y}}_{t_L})^2 + \frac{1}{n}\sum_{i\in t_R} (Y_i-{\overline{Y}}_{t_R})^2 \] This leads to the weighted variance (mse) splitting used by CART and RF

Split rules for regression

Family	`splitrule`
Regression Quantile Regression	mse `la.quantile.regr, quantile.regr,`mse
Classification Imbalanced Two-Class	`gini, auc, entropy` `gini, auc, entropy`
Survival	`logrank, bs.gradient, logrankscore`
Competing Risk	`logrankCR, logrank`
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`mv.mse, mahalanobis` `mv.gini` `mv.mix` `mv.mse` `mv.mix`
Unsupervised sidClustering Breiman (Shi-Horvath)	`unsupv` $\{$`mv.mse, mv.gini, mv.mix`$\}$, `mahalanobis` `gini, auc, entropy`

Mean-squared error (mse) splitting rule

Let $Y_1,\ldots,Y_n$ be the outcomes in $t$, the mse split-statistic for a proposed split $s$ is \[D(s,t)=\frac{1}{n}\sum_{i\in t_L}(Y_i-\overline{Y}_{t_L})^2 + \frac{1}{n}\sum_{i\in t_R}(Y_i-\overline{Y}_{t_R})^2\] where

$t_L=\{X_i\le s\}$ and $t_R=\{X_i>s\}$

are the left and right daughter nodes and

$\overline{Y}_{t_L}=\frac{1}{n}\sum_{i\in t_L}Y_i$ and $\overline{Y}_{t_R}=\frac{1}{n}\sum_{i\in t_R}Y_i$

are the sample means for $t_L$ and $t_R$ respectively. The best split for $X$ is the split-point $s$ minimizing $D(s,t)$ [4]

Mean-squared error (mse) splitting rule

$t_L=\{X_i\le s\}$ and $t_R=\{X_i>s\}$

are the left and right daughter nodes and

$\overline{Y}_{t_L}=\frac{1}{n}\sum_{i\in t_L}Y_i$ and $\overline{Y}_{t_R}=\frac{1}{n}\sum_{i\in t_R}Y_i$

are the sample means for $t_L$ and $t_R$ respectively. The best split for $X$ is the split-point $s$ minimizing $D(s,t)$ [4]

Key quantities

The key quantities returned by the package are

o$predicted       --->   inbag estimated conditional mean
o$predicted.oob   --->   OOB estimated conditional mean

Regression example: Iowa housing

Ames Iowa Housing Data

Data from the Ames Assessor’s Office used in assessing values of individual residential properties sold in Ames, Iowa from 2006 to 2010. This is a regression problem and the goal is to predict “SalePrice” which records the price of a home in thousands of dollars [5]

PID	MS.SubClass	MS.Zoning	Lot.Frontage	Lot.Area	Street	Alley	Lot.Shape	Land.Contour	Utilities	Lot.Config	Land.Slope	Neighborhood	Condition.1	Condition.2	Bldg.Type	House.Style	Overall.Qual	Overall.Cond	Year.Built	Year.Remod.Add	Roof.Style	Roof.Matl	Exterior.1st	Exterior.2nd	Mas.Vnr.Type	Mas.Vnr.Area	Exter.Qual	Exter.Cond	Foundation	Bsmt.Qual	Bsmt.Cond	Bsmt.Exposure	BsmtFin.Type.1	BsmtFin.SF.1	BsmtFin.Type.2	BsmtFin.SF.2	Bsmt.Unf.SF	Total.Bsmt.SF	Heating	Heating.QC	Central.Air	Electrical	X1st.Flr.SF	X2nd.Flr.SF	Gr.Liv.Area	Bsmt.Full.Bath	Full.Bath	Half.Bath	Bedroom.AbvGr	Kitchen.AbvGr	Kitchen.Qual	TotRms.AbvGrd	Functional	Fireplaces	Fireplace.Qu	Garage.Type	Garage.Yr.Blt	Garage.Finish	Garage.Cars	Garage.Area	Garage.Qual	Garage.Cond	Paved.Drive	Wood.Deck.SF	Open.Porch.SF	Screen.Porch	Pool.QC	Fence	Misc.Feature	Misc.Val	Mo.Sold	Yr.Sold	Sale.Type	Sale.Condition	SalePrice
526301100	20	RL	141	31770	Pave	noAlleyAccess	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	1960	1960	Hip	CompShg	BrkFace	Plywood	Stone	112	TA	TA	CBlock	TA	Gd	Gd	BLQ	639	Unf	0	441	1080	GasA	Fa	Y	SBrkr	1656	0	1656	1	1	0	3	1	TA	7	Typ	2	Gd	Attchd	1960	Fin	2	528	TA	TA	P	210	62	0	noPool	noFence	none	0	5	2010	WD	Normal	215000
526350040	20	RH	80	11622	Pave	noAlleyAccess	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	1961	1961	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	CBlock	TA	TA	No	Rec	468	LwQ	144	270	882	GasA	TA	Y	SBrkr	896	0	896	0	1	0	2	1	TA	5	Typ	0	noFireplace	Attchd	1961	Unf	1	730	TA	TA	Y	140	0	120	noPool	MnPrv	none	0	6	2010	WD	Normal	105000
526351010	20	RL	81	14267	Pave	noAlleyAccess	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	1958	1958	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108	TA	TA	CBlock	TA	TA	No	ALQ	923	Unf	0	406	1329	GasA	TA	Y	SBrkr	1329	0	1329	0	1	1	3	1	Gd	6	Typ	0	noFireplace	Attchd	1958	Unf	1	312	TA	TA	Y	393	36	0	noPool	noFence	Gar2	12500	6	2010	WD	Normal	172000
526353030	20	RL	93	11160	Pave	noAlleyAccess	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	1968	1968	Hip	CompShg	BrkFace	BrkFace	None	0	Gd	TA	CBlock	TA	TA	No	ALQ	1065	Unf	0	1045	2110	GasA	Ex	Y	SBrkr	2110	0	2110	1	2	1	3	1	Ex	8	Typ	2	TA	Attchd	1968	Fin	2	522	TA	TA	Y	0	0	0	noPool	noFence	none	0	4	2010	WD	Normal	244000
527105010	60	RL	74	13830	Pave	noAlleyAccess	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	1997	1998	Gable	CompShg	VinylSd	VinylSd	None	0	TA	TA	PConc	Gd	TA	No	GLQ	791	Unf	0	137	928	GasA	Gd	Y	SBrkr	928	701	1629	0	2	1	3	1	TA	6	Typ	1	TA	Attchd	1997	Fin	2	482	TA	TA	Y	212	34	0	noPool	MnPrv	none	0	3	2010	WD	Normal	189900
527105030	60	RL	78	9978	Pave	noAlleyAccess	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	6	6	1998	1998	Gable	CompShg	VinylSd	VinylSd	BrkFace	20	TA	TA	PConc	TA	TA	No	GLQ	602	Unf	0	324	926	GasA	Ex	Y	SBrkr	926	678	1604	0	2	1	3	1	Gd	7	Typ	1	Gd	Attchd	1998	Fin	2	470	TA	TA	Y	360	36	0	noPool	noFence	none	0	6	2010	WD	Normal	195500

data(housing, package = "randomForestSRC")  
dim(housing)  
> [1] 2930   81

Regression example: Iowa housing

The key quantities returned by the package are

o$predicted       --->   inbag estimated conditional mean
o$predicted.oob   --->   OOB estimated conditional mean

o <- rfsrc(SalePrice~., data = housing, nodesize = 1)
> o$predicted[1:5] #  --->   inbag estimated conditional mean
> [1] 12.17378 11.63555 11.99636 12.40886 12.14098 12.16425

o$predicted.oob[1:5] #  --->   OOB estimated conditional mean
> [1] 12.08462 11.70282 11.95179 12.39756 12.13130 12.15017

Regression example: Iowa housing

Tip: omit or impute missing data

The default setting o <- rfsrc(…, na.action = “na.omit”) will provide complete cases in o$yvar and o$xvar
impute() is useful for imputing missing data

PID	MS.SubClass	MS.Zoning	Lot.Frontage	Lot.Area	Street	Alley	Lot.Shape	Land.Contour	Lot.Frontage.1	Garage.Yr.Blt
527402200	20	RL	NA	11241	Pave	noAlleyAccess	IR1	Lvl	NA	1970
527402250	20	RL	NA	12537	Pave	noAlleyAccess	IR1	Lvl	NA	1971
528240070	60	RL	NA	7851	Pave	noAlleyAccess	Reg	Lvl	NA	2002
527425090	20	RL	70	10500	Pave	noAlleyAccess	Reg	Lvl	70	NA
534276360	20	RL	77	9320	Pave	noAlleyAccess	IR1	Lvl	77	NA
909131125	190	RH	NA	7082	Pave	noAlleyAccess	Reg	Lvl	NA	NA
534427010	90	RL	98	13260	Pave	noAlleyAccess	IR1	Lvl	98	NA
534450180	20	RL	50	7207	Pave	noAlleyAccess	IR1	Lvl	50	NA
904351040	70	C (all)	NA	6449	Pave	noAlleyAccess	IR1	Lvl	NA	NA
903200050	30	RL	NA	7446	Pave	noAlleyAccess	Reg	Lvl	NA	NA

Regression example: Iowa housing

Tip: omit or impute missing data

The default setting o <- rfsrc(…, na.action = “na.omit”) will provide complete cases in o$yvar and o$xvar
impute() is used for imputing missing data

PID	MS.SubClass	MS.Zoning	Lot.Frontage	Lot.Area	Street	Alley	Lot.Shape	Land.Contour	Lot.Frontage.1	Garage.Yr.Blt
527402200	20	RL	NA	11241	Pave	noAlleyAccess	IR1	Lvl	NA	1970
527402250	20	RL	NA	12537	Pave	noAlleyAccess	IR1	Lvl	NA	1971
528240070	60	RL	NA	7851	Pave	noAlleyAccess	Reg	Lvl	NA	2002
527425090	20	RL	70	10500	Pave	noAlleyAccess	Reg	Lvl	70	NA
534276360	20	RL	77	9320	Pave	noAlleyAccess	IR1	Lvl	77	NA
909131125	190	RH	NA	7082	Pave	noAlleyAccess	Reg	Lvl	NA	NA
534427010	90	RL	98	13260	Pave	noAlleyAccess	IR1	Lvl	98	NA
534450180	20	RL	50	7207	Pave	noAlleyAccess	IR1	Lvl	50	NA
904351040	70	C (all)	NA	6449	Pave	noAlleyAccess	IR1	Lvl	NA	NA
903200050	30	RL	NA	7446	Pave	noAlleyAccess	Reg	Lvl	NA	NA

nrow(housing)
>  [1] 2930
o <- rfsrc(SalePrice~., data= housing, ntree = 1)
length(o$yvar)
>  [1] 2274
nrow(o$xvar)
>  [1] 2274

Regression example: Iowa housing

Tip: omit or impute missing data

The default setting o <- rfsrc(…, na.action = “na.omit”) will provide complete cases in o$yvar and o$xvar
impute() is used for imputing missing data

PID	MS.SubClass	MS.Zoning	Lot.Frontage	Lot.Area	Street	Alley	Lot.Shape	Land.Contour	Lot.Frontage.1	Garage.Yr.Blt
527402200	20	RL	NA	11241	Pave	noAlleyAccess	IR1	Lvl	NA	1970
527402250	20	RL	NA	12537	Pave	noAlleyAccess	IR1	Lvl	NA	1971
528240070	60	RL	NA	7851	Pave	noAlleyAccess	Reg	Lvl	NA	2002
527425090	20	RL	70	10500	Pave	noAlleyAccess	Reg	Lvl	70	NA
534276360	20	RL	77	9320	Pave	noAlleyAccess	IR1	Lvl	77	NA
909131125	190	RH	NA	7082	Pave	noAlleyAccess	Reg	Lvl	NA	NA
534427010	90	RL	98	13260	Pave	noAlleyAccess	IR1	Lvl	98	NA
534450180	20	RL	50	7207	Pave	noAlleyAccess	IR1	Lvl	50	NA
904351040	70	C (all)	NA	6449	Pave	noAlleyAccess	IR1	Lvl	NA	NA
903200050	30	RL	NA	7446	Pave	noAlleyAccess	Reg	Lvl	NA	NA

nrow(housing)
>  [1] 2930
# Use supervised OTF imputation
housing.im <- impute(SalePrice~., data = housing)
nrow(housing.im)
>  [1] 2930
housing[c(12,15,23,24,25,56,28,120,2846,126,130,214,2671),c(1:9,4,60)]
>             PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street         Alley Lot.Shape Land.Contour Lot.Frontage.1 Garage.Yr.Blt
>  12   527165230          20        RL           NA     7980   Pave noAlleyAccess       IR1          Lvl             NA          1992
>  15   527182190         120        RL           NA     6820   Pave noAlleyAccess       IR1          Lvl             NA          1985
>  23   527368020          60        FV           NA     7500   Pave noAlleyAccess       Reg          Lvl             NA          2000
>  24   527402200          20        RL           NA    11241   Pave noAlleyAccess       IR1          Lvl             NA          1970
>  25   527402250          20        RL           NA    12537   Pave noAlleyAccess       IR1          Lvl             NA          1971
>  56   528240070          60        RL           NA     7851   Pave noAlleyAccess       Reg          Lvl             NA          2002
>  28   527425090          20        RL           70    10500   Pave noAlleyAccess       Reg          Lvl             70            NA
>  120  534276360          20        RL           77     9320   Pave noAlleyAccess       IR1          Lvl             77            NA
>  2846 909131125         190        RH           NA     7082   Pave noAlleyAccess       Reg          Lvl             NA            NA
>  126  534427010          90        RL           98    13260   Pave noAlleyAccess       IR1          Lvl             98            NA
>  130  534450180          20        RL           50     7207   Pave noAlleyAccess       IR1          Lvl             50            NA
>  214  904351040          70   C (all)           NA     6449   Pave noAlleyAccess       IR1          Lvl             NA            NA
>  2671 903200050          30        RL           NA     7446   Pave noAlleyAccess       Reg          Lvl             NA            NA

nrow(housing)
>  [1] 2930
# do quick and dirty imputation
housing.im <- impute(SalePrice~., data = housing)
nrow(housing.im)
>  [1] 2930
housing.im[c(12,15,23,24,25,56,28,120,2846,126,130,214,2671),c(1:9,4,60)]
>             PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street         Alley Lot.Shape Land.Contour Lot.Frontage.1 Garage.Yr.Blt
>  12   527165230          20        RL     68.85565     7980   Pave noAlleyAccess       IR1          Lvl       68.85565      1992.000
>  15   527182190         120        RL     68.56751     6820   Pave noAlleyAccess       IR1          Lvl       68.56751      1985.000
>  23   527368020          60        FV     66.21796     7500   Pave noAlleyAccess       Reg          Lvl       66.21796      2000.000
>  24   527402200          20        RL     74.53002    11241   Pave noAlleyAccess       IR1          Lvl       74.53002      1970.000
>  25   527402250          20        RL     80.12155    12537   Pave noAlleyAccess       IR1          Lvl       80.12155      1971.000
>  56   528240070          60        RL     71.88750     7851   Pave noAlleyAccess       Reg          Lvl       71.88750      2002.000
>  28   527425090          20        RL     70.00000    10500   Pave noAlleyAccess       Reg          Lvl       70.00000      1965.035
>  120  534276360          20        RL     77.00000     9320   Pave noAlleyAccess       IR1          Lvl       77.00000      1961.286
>  2846 909131125         190        RH     63.99655     7082   Pave noAlleyAccess       Reg          Lvl       63.99655      1946.405
>  126  534427010          90        RL     98.00000    13260   Pave noAlleyAccess       IR1          Lvl       98.00000      1968.485
>  130  534450180          20        RL     50.00000     7207   Pave noAlleyAccess       IR1          Lvl       50.00000      1962.214
>  214  904351040          70   C (all)     64.52222     6449   Pave noAlleyAccess       IR1          Lvl       64.52222      1942.500
>  2671 903200050          30        RL     66.49275     7446   Pave noAlleyAccess       Reg          Lvl       66.49275      1952.198

Regression example: Iowa housing

Tip: model with missing data

rfsrc(…, na.action = “na.impute”) implements on-the-fly imputation

PID	MS.SubClass	MS.Zoning	Lot.Frontage	Lot.Area	Street	Alley	Lot.Shape	Land.Contour	Lot.Frontage.1	Garage.Yr.Blt
527402200	20	RL	NA	11241	Pave	noAlleyAccess	IR1	Lvl	NA	1970
527402250	20	RL	NA	12537	Pave	noAlleyAccess	IR1	Lvl	NA	1971
528240070	60	RL	NA	7851	Pave	noAlleyAccess	Reg	Lvl	NA	2002
527425090	20	RL	70	10500	Pave	noAlleyAccess	Reg	Lvl	70	NA
534276360	20	RL	77	9320	Pave	noAlleyAccess	IR1	Lvl	77	NA
909131125	190	RH	NA	7082	Pave	noAlleyAccess	Reg	Lvl	NA	NA
534427010	90	RL	98	13260	Pave	noAlleyAccess	IR1	Lvl	98	NA
534450180	20	RL	50	7207	Pave	noAlleyAccess	IR1	Lvl	50	NA
904351040	70	C (all)	NA	6449	Pave	noAlleyAccess	IR1	Lvl	NA	NA
903200050	30	RL	NA	7446	Pave	noAlleyAccess	Reg	Lvl	NA	NA

o <- rfsrc(SalePrice~., data = housing)
o
                         Sample size: 2274
                     Number of trees: 500
           Forest terminal node size: 5
       Average no. of terminal nodes: 310.83
No. of variables tried at each split: 27
              Total no. of variables: 80
       Resampling used to grow trees: swor
    Resample size used to grow trees: 1437
                            Analysis: RF-R
                              Family: regr
                      Splitting rule: mse *random*
       Number of random split points: 10
                     (OOB) R squared: 0.90907743
   (OOB) Requested performance error: 631482361.907825

o <- rfsrc(SalePrice~., data = housing, na.action = "na.impute")
o
                         Sample size: 2930
                    Was data imputed: yes
                     Number of trees: 500
           Forest terminal node size: 5
       Average no. of terminal nodes: 402.424
No. of variables tried at each split: 27
              Total no. of variables: 80
       Resampling used to grow trees: swor
    Resample size used to grow trees: 1852
                            Analysis: RF-R
                              Family: regr
                      Splitting rule: mse *random*
       Number of random split points: 10
                     (OOB) R squared: 0.90785994
   (OOB) Requested performance error: 588027121.705363

Regression example: Iowa housing

Transform the outcome

Tip

Monotonic transformation on the outcome won’t substantially change the structure of random forest due to its robustness. However, it may be useful for scaling the mean squared error

o <- rfsrc(SalePrice~., data= housing)
o
>                          Sample size: 2274
>                      Number of trees: 500
>            Forest terminal node size: 5
>        Average no. of terminal nodes: 310.594
> No. of variables tried at each split: 27
>               Total no. of variables: 80
>        Resampling used to grow trees: swor
>     Resample size used to grow trees: 1437
>                             Analysis: RF-R
>                               Family: regr
>                       Splitting rule: mse *random*
>        Number of random split points: 10
>                      (OOB) R squared: 0.90969641
>    (OOB) Requested performance error: 627183369.648676

> housing$SalePrice <- log(housing$SalePrice)
> o <- rfsrc(SalePrice~., data= housing)
> o
                         Sample size: 2274
                     Number of trees: 500
           Forest terminal node size: 5
       Average no. of terminal nodes: 310.61
No. of variables tried at each split: 27
              Total no. of variables: 80
       Resampling used to grow trees: swor
    Resample size used to grow trees: 1437
                            Analysis: RF-R
                              Family: regr
                      Splitting rule: mse *random*
       Number of random split points: 10
                     (OOB) R squared: 0.89837879
   (OOB) Requested performance error: 0.01688687

Regression example: Iowa housing

Tune nodesize

rfsrc(

formula, data, ntree = 500, mtry = NULL, ytry = NULL, nodesize = NULL, …)

nodesize sets the minumum size of terminal, default values are:

  15 for survival
  15 for competing risk
  5 for regression
  1 for classification
  3 for mixed outcomes
  3 for unsupervised

We can use tune.nodesize()

o <- tune.nodesize(SalePrice~., housing)
o$nsize.opt
> [1] 1

> nodesize =  1    error = 10.75% 
> nodesize =  2    error = 11.36% 
> nodesize =  3    error = 11.55% 
> nodesize =  4    error = 11.3% 
> nodesize =  5    error = 11.72% 
> nodesize =  6    error = 12.2% 
> nodesize =  7    error = 12.75% 
> nodesize =  8    error = 12.54% 
> nodesize =  9    error = 13.2% 
> nodesize =  10    error = 13.98% 
> nodesize =  15    error = 14.88% 
> nodesize =  20    error = 17.14% 
> nodesize =  25    error = 19.26% 
> nodesize =  30    error = 20.92% 
> nodesize =  35    error = 21.64% 
> nodesize =  40    error = 23.86% 
> nodesize =  45    error = 24.64% 
> nodesize =  50    error = 25.4% 
> nodesize =  55    error = 26.87% 
> optimal nodesize: 1

Quantile regression

Family	Example Grow Call with Formula Specification
Regression Quantile Regression	`rfsrc(Ozone~., data = airquality)` quantreg(mpg~., data = mtcars)
Classification Imbalanced Two-Class	`rfsrc(Species~., data = iris)` `imbalanced(status~., data = breast)`
Survival	`rfsrc(Surv(time, status)~., data = veteran)`
Competing Risk	`rfsrc(Surv(time, status)~., data = wihs)`
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`rfsrc(Multivar(mpg, cyl)~., data = mtcars)` `rfsrc(cbind(Species,Sepal.Length)~.,data=iris)` `quantreg(cbind(mpg, cyl)~., data = mtcars)` `quantreg(cbind(Species,Sepal.Length)~.,data=iris)`
Unsupervised sidClustering Breiman (Shi-Horvath)	`rfsrc(data = mtcars)` `sidClustering(data = mtcars)` `sidClustering(data = mtcars, method = "sh")`

Quantile regression

Quantile regression estimates the conditional distribution function (CDF) \[ F(y|\mathbf{ X}=\mathbf{ x})=P\{Y\le y|\mathbf{ X}=\mathbf{ x}\} \]

The empirical risk functional for a quantile $0<\alpha<1$ is \[ R_n(F)=\frac{1}{n}\sum_{i=1}^n L_{\alpha}(Y_i, F(\mathbf{ x}_i)) \] where $L_{\alpha}(Y, u)=\phi_{\alpha}(Y-u)$ is the loss function (pinball loss) and \[ \phi_{\alpha}(u)=u(\alpha-1_{\{u<0\}}) \] is the check-loss function

Quantile regression

In regression, the tree node estimator is the sample average \[ E(Y|\mathbf{ X}\in t) = \overline{Y}_{t} \] In quantile regression it is the CDF \[ P\{Y\le y|\mathbf{ X}\in t\} = \frac{1}{n}\sum_{i\in t} 1_{\{Y_i\le y\}} \]

Split rules for regression

Family	`splitrule`
Regression Quantile Regression	mse la.quantile.regr, quantile.regr,mse
Classification Imbalanced Two-Class	`gini, auc, entropy` `gini, auc, entropy`
Survival	`logrank, bs.gradient, logrankscore`
Competing Risk	`logrankCR, logrank`
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`mv.mse, mahalanobis` `mv.gini` `mv.mix` `mv.mse` `mv.mix`
Unsupervised sidClustering Breiman (Shi-Horvath)	`unsupv` $\{$`mv.mse, mv.gini, mv.mix`$\}$, `mahalanobis` `gini, auc, entropy`

Quantile regression splitting rule

quantile.regr is the quantile splitting rule using quantile loss (pinball loss). It requires specifying the target quantiles which by default are the percentiles 1%, 2%, $\ldots$, 99%
la.quantile.regr refers to local adaptive quantile regression splitting (the default action for quantreg) under default percentiles 1%, 2%, $\ldots$, 99%

Key quantities

The key quantities returned by the package are

o$quantreg  ---> quantiles for the CDF (extracted by helper function)

General call to quantreg

Regression example: Iowa housing

Quantile regression

o <- quantreg(SalePrice ~ ., housing, splitrule = "mse", ntree = 250)
o <- quantreg(SalePrice ~ ., housing, splitrule = "quantile.regr", ntree = 250)
o <- quantreg(SalePrice ~ ., housing, splitrule = "la.quantile.regr", ntree = 250) # (default)

> o
                         Sample size: 2274
                     Number of trees: 250
           Forest terminal node size: 5
       Average no. of terminal nodes: 263.86
No. of variables tried at each split: 27
              Total no. of variables: 80
       Resampling used to grow trees: swor
    Resample size used to grow trees: 1437
                            Analysis: RF-R
                              Family: regr
                      Splitting rule: la.quantile.regr *random*
       Number of random split points: 10
                     (OOB) R squared: 0.85752024
   (OOB) Requested performance error: 0.02367652

plot.quantreg(o)

Regression example: Iowa housing

The key quantities returned by the package are

o$quantreg  ---> quantiles for the CDF (extracted by helper function)

o <- quantreg(SalePrice ~ ., housing)
names(o$quantreg)
> [1] "quantiles" "prob"      "cdf"       "density"   "yunq"     

head(o$quantreg$yunq)
> [1]  9.456341  9.480368 10.463103 10.471950 10.596635 10.714418
head(o$quantreg$prob)
> [1] 0.01 0.02 0.03 0.04 0.05 0.06
head(o$quantreg$cdf[ ,1:5])
>      [,1] [,2]         [,3]         [,4]         [,5]
> [1,]    0    0 0.0008795075 0.0008795075 0.0008795075
> [2,]    0    0 0.0008795075 0.0008795075 0.0008795075
> [3,]    0    0 0.0008795075 0.0008795075 0.0008795075
> [4,]    0    0 0.0008795075 0.0008795075 0.0008795075
> [5,]    0    0 0.0008795075 0.0008795075 0.0008795075
> [6,]    0    0 0.0008795075 0.0008795075 0.0008795075
head(o$quantreg$density[ ,1:5])
>      [,1] [,2]         [,3] [,4] [,5]
> [1,]    0    0 0.0008795075    0    0
> [2,]    0    0 0.0008795075    0    0
> [3,]    0    0 0.0008795075    0    0
> [4,]    0    0 0.0008795075    0    0
> [5,]    0    0 0.0008795075    0    0
> [6,]    0    0 0.0008795075    0    0
head(o$quantreg$quantiles[ ,1:5])
>          [,1]     [,2]     [,3]     [,4]     [,5]
> [1,] 11.53273 11.68658 11.74543 11.77913 11.80560
> [2,] 11.24505 11.40199 11.46163 11.49883 11.52288
> [3,] 11.42954 11.58058 11.64151 11.67844 11.70355
> [4,] 11.81303 11.96285 12.01974 12.05815 12.08108
> [5,] 11.64395 11.79434 11.85296 11.88793 11.91505
> [6,] 11.68856 11.83501 11.89478 11.93164 11.95761

Regression example: Iowa housing

## (A) extract 25,50,75 quantiles
  quant.dat <- get.quantile(o, c(.25, .50, .75))
  head(quant.dat)
>          q.25     q.50     q.75
> [1,] 12.01673 12.07254 12.13081
> [2,] 11.62536 11.67844 11.73767
> [3,] 11.89341 11.94795 12.00457
> [4,] 12.35017 12.40492 12.46071
> [5,] 12.04614 12.10071 12.15767
> [6,] 12.08954 12.14420 12.20106

## (B) values expected under normality
  quant.stat <- get.quantile.stat(o)
  c.mean <- quant.stat$mean
  c.std <- quant.stat$std
  q.25.est <- c.mean + qnorm(.25) * c.std
  q.75.est <- c.mean + qnorm(.75) * c.std

## compare (A) and (B)
  print(head(data.frame(quant.dat[, -2],  q.25.est, q.75.est)))
>     q.25   q.75 q.25.est q.75.est
> 1 12.01673 12.13081 11.98342 12.15846
> 2 11.62536 11.73767 11.59368 11.76268
> 3 11.89341 12.00457 11.85888 12.03187
> 4 12.35017 12.46071 12.31451 12.49045
> 5 12.04614 12.15767 12.01076 12.18538
> 6 12.08954 12.20106 12.05462 12.22996

Regression example: Iowa housing

The run.rfsrc function for an overview

Runs rfsrc on a user’s specified data and performs a detailed analysis including tuning the forests and plotting various quantities such as performance metrics and variable importance. Applies to regression, classification, survival, competing risk and multivariate families

library(devtools)
devtools::install_github("kogalur/randomForestSRC.run")

Regression example: Iowa housing

library(randomForestSRC.run)
run.rfsrc(SalePrice ~ ., housing, ntree = 500)

Classification

Family	Example Grow Call with Formula Specification
Regression Quantile Regression	`rfsrc(Ozone~., data = airquality)` `quantreg(mpg~., data = mtcars)`
Classification Imbalanced Two-Class	rfsrc(Species~., data = iris) `imbalanced(status~., data = breast)`
Survival	`rfsrc(Surv(time, status)~., data = veteran)`
Competing Risk	`rfsrc(Surv(time, status)~., data = wihs)`
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`rfsrc(Multivar(mpg, cyl)~., data = mtcars)` `rfsrc(cbind(Species,Sepal.Length)~.,data=iris)` `quantreg(cbind(mpg, cyl)~., data = mtcars)` `quantreg(cbind(Species,Sepal.Length)~.,data=iris)`
Unsupervised sidClustering Breiman (Shi-Horvath)	`rfsrc(data = mtcars)` `sidClustering(data = mtcars)` `sidClustering(data = mtcars, method = "sh")`

Classification

The classification-regression goal is to estimate \[ F_j(\mathbf{x})=P\{Y=j|\mathbf{X}=\mathbf{x}\} \hskip10pt \text{for } j=1,\ldots,J \] by minimizing the empirical risk functional, $F=(F_1,\ldots,F_J)$, \[ R_n(F)=\frac{1}{n}\sum_{i=1}^n L(Y_i, F(\mathbf{x}_i)) \] using loss functions like Gini impurity, entropy and the Brier score

Classification

The classification goal is to classify cases, typically performed using the Bayes rule (maximal probability) \[ \delta(\mathbf{x}) = j, \hskip10pt \text{where } F_j(\mathbf{x})\ge F_{j'},(\mathbf{x})\hskip10pt j'=1,\ldots,J \]

The tree node estimator equals the relative frequencies \[ P\{Y=j|\mathbf{X}\in t\} = \dfrac{\text{number of cases in $t$ that are class label $j$}}{\text{number of cases in $t$}} \]

Key quantities

The key quantities returned by the package are

o$predicted     --->   inbag estimated probabilities
o$predicted.oob --->   OOB estimated probabilities

o$class         --->   inbag class predictions
o$class.oob     --->   OOB class predictions

Classification example: Glioma

Subset of the data used in Ceccarelli et al. (2016) for molecular profiling of adult diffuse gliomas. It contains 1206 probes from 880 tissues, clinical data and other molecular data. The outcome has class labels: Classic-like, Codel, G-CIMP-high, G-CIMP-low, LGm6-GBM, Mesenchymal-like and PA-like [6]

y	cg23970089	cg25176823	cg24576735	Chr19.20co.gain	TERT.promoter.status	SNP6	U133a	grade	age	gender
Mesenchymal-like	0.36518600	0.3244530	0.03255164	No chr 19/20 gain	Mutant	Yes	Yes	G4	44	female
Mesenchymal-like	0.60246131	0.3626941	0.02215452	No chr 19/20 gain	Mutant	Yes	Yes	G4	50	male
Mesenchymal-like	0.44011951	0.3679728	0.03369096	No chr 19/20 gain	Mutant	Yes	No	G4	56	female
Classic-like	0.06834791	0.3974054	0.03623296	No chr 19/20 gain	Mutant	Yes	Yes	G4	40	female
Classic-like	0.19843219	0.2736760	0.04060163	No chr 19/20 gain	Mutant	Yes	Yes	G4	61	female

# install.packages("devtools") # if you have not installed "devtools" package
devtools::install_github("kogalur/varPro")
data(glioma, package = "varPro") 
dim(glioma)
> [1]  880 1242

Classification example: Glioma

o <- rfsrc(y~., data = glioma)
o
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 74.302
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: gini *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.02333115
        (OOB) Normalized Brier score: 0.19053769
                           (OOB) AUC: 0.99304797
                      (OOB) Log-loss: 0.34115074
   (OOB) Requested performance error: 0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462

Confusion matrix:

                  predicted
  observed         Classic-like Codel G-CIMP-high G-CIMP-low LGm6-GBM Mesenchymal-like PA-like class.error
  Classic-like              138     0           0          0        0               10       0      0.0676
  Codel                       0   172           2          0        0                0       0      0.0115
  G-CIMP-high                 0     2         247          0        0                0       0      0.0080
  G-CIMP-low                  0     0          13         12        0                0       0      0.5200
  LGm6-GBM                    0     0           1          0       19               21       0      0.5366
  Mesenchymal-like           11     0           0          0        0              204       0      0.0512
  PA-like                     0     0           0          0        0                3      23      0.1154

      (OOB) Misclassification rate: 0.07175399

Tip

rfsrc recognize classification problem automatically when the outcome is a factor

Key quantities

The key quantities returned by the package are

o$predicted     --->   inbag estimated probabilities
o$predicted.oob --->   OOB estimated probabilities

o$class         --->   inbag class predictions
o$class.oob     --->   OOB class predictions

o$predicted[1:5,]     # --->   inbag estimated probabilities
>      Classic-like Codel G-CIMP-high G-CIMP-low LGm6-GBM Mesenchymal-like PA-like
> [1,]        0.006 0.002       0.000      0.004    0.084            0.892   0.012
> [2,]        0.136 0.008       0.000      0.004    0.014            0.832   0.006
> [3,]        0.064 0.004       0.008      0.010    0.018            0.892   0.004
> [4,]        0.928 0.008       0.004      0.004    0.002            0.054   0.000
> [5,]        0.936 0.000       0.000      0.002    0.000            0.062   0.000
o$predicted.oob[1:5,] # --->   OOB estimated probabilities
>      Classic-like       Codel G-CIMP-high  G-CIMP-low    LGm6-GBM Mesenchymal-like    PA-like
> [1,]   0.01840491 0.006134969  0.00000000 0.012269939 0.257668712        0.6687117 0.03680982
> [2,]   0.38857143 0.022857143  0.00000000 0.011428571 0.040000000        0.5200000 0.01714286
> [3,]   0.17391304 0.010869565  0.02173913 0.027173913 0.048913043        0.7065217 0.01086957
> [4,]   0.81443299 0.020618557  0.01030928 0.010309278 0.005154639        0.1391753 0.00000000
> [5,]   0.82222222 0.000000000  0.00000000 0.005555556 0.000000000        0.1722222 0.00000000

o$class[1:5]       #  --->   inbag class predictions
> [1] Mesenchymal-like Mesenchymal-like Mesenchymal-like Classic-like     Classic-like    

o$class.oob[1:5]   #  --->   OOB class predictions
> [1] Mesenchymal-like Mesenchymal-like Mesenchymal-like Classic-like     Classic-like

Classification example: Glioma

Split rules

Family	`splitrule`
Regression Quantile Regression	`mse` `la.quantile.regr, quantile.regr, mse`
Classification Imbalanced Two-Class	gini, auc, entropy `gini, auc, entropy`
Survival	`logrank, bs.gradient, logrankscore`
Competing Risk	`logrankCR, logrank`
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`mv.mse, mahalanobis` `mv.gini` `mv.mix` `mv.mse` `mv.mix`
Unsupervised sidClustering Breiman (Shi-Horvath)	`unsupv` $\{$`mv.mse, mv.gini, mv.mix`$\}$, `mahalanobis` `gini, auc, entropy`

Classification example: Glioma

Split rules: Gini index splitting (default)

o <- rfsrc(y~., data = glioma,
           splitrule="gini") ## default splitrule as in the previous slide
o
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 74.302
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: gini *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.02333115
        (OOB) Normalized Brier score: 0.19053769
                           (OOB) AUC: 0.99304797
                      (OOB) Log-loss: 0.34115074
   (OOB) Requested performance error: 0.07061503, 0.06756757, 0.01149425, 0.00803213, 0.52, 0.53658537, 0.04651163, 0.11538462

See Chapter 4.3 in Breiman et al. 1984 [4]

Classification example: Glioma

Split rules: AUC (area under the ROC curve)

o <- rfsrc(y~., data = glioma,
           splitrule="auc")
o
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 434.522
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: auc *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.04799401
        (OOB) Normalized Brier score: 0.39195105
                           (OOB) AUC: 0.97510083
                      (OOB) Log-loss: 0.62743109
   (OOB) Requested performance error: 0.16514806, 0.25675676, 0.09770115, 0.01606426, 0.84, 0.97560976, 0.04651163, 0.57692308

Tip

AUC splitting is appropriate for imbalanced data

Classification example: Glioma

Split rules: entropy splitting

o <- rfsrc(y~., data = glioma,
           splitrule="entropy")
o
                         Sample size: 878
           Frequency of class labels: 148, 174, 249, 25, 41, 215, 26
                     Number of trees: 500
           Forest terminal node size: 1
       Average no. of terminal nodes: 287.266
No. of variables tried at each split: 36
              Total no. of variables: 1241
       Resampling used to grow trees: swor
    Resample size used to grow trees: 555
                            Analysis: RF-C
                              Family: class
                      Splitting rule: entropy *random*
       Number of random split points: 10
                   (OOB) Brier score: 0.04504941
        (OOB) Normalized Brier score: 0.36790354
                           (OOB) AUC: 0.96636193
                      (OOB) Log-loss: 0.64571496
   (OOB) Requested performance error: 0.15148064, 0.16216216, 0.04022989, 0.01204819, 0.72, 1, 0.10232558, 0.69230769

See Chapters 2.5 and 4.3 in Breiman et al. 1984 [4]

The run.rfsrc function for an overview

run.rfsrc(y~., data = glioma)

Survival

Family	Example Grow Call with Formula Specification
Regression Quantile Regression	`rfsrc(Ozone~., data = airquality)` `quantreg(mpg~., data = mtcars)`
Classification Imbalanced Two-Class	`rfsrc(Species~., data = iris)` `imbalanced(status~., data = breast)`
Survival	rfsrc(Surv(time, status)~., data = veteran)
Competing Risk	`rfsrc(Surv(time, status)~., data = wihs)`
Multivariate Regression Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`rfsrc(Multivar(mpg, cyl)~., data = mtcars)` `rfsrc(cbind(Species,Sepal.Length)~.,data=iris)` `quantreg(cbind(mpg, cyl)~., data = mtcars)` `quantreg(cbind(Species,Sepal.Length)~.,data=iris)`
Unsupervised sidClustering Breiman (Shi-Horvath)	`rfsrc(data = mtcars)` `sidClustering(data = mtcars)` `sidClustering(data = mtcars, method = "sh")`

Survival

In survival one of the primary goals is estimating the survival function \[ S(t|\mathbf{X}=\mathbf{x}) = P\{T^o> t|\mathbf{X}=\mathbf{x}\} \] where $T^o$ is the survival time

Random survival forests [9] uses the KM (Kaplan-Meier) estimator with the general estimation strategy (product limit estimator) \[ S(t) = S(t_1) \times S(t_2|t_1) \times \cdots \times S(t_M|t_{M-1}) \] with \[ S(t_j|t_{j-1}) = \frac{\text{number surviving beyond $t=t_j$}}{\text{number surviving beyond $t>t_{j-1}$}}. \]

Survival

This deals with generalized Type-I censoring where patients enter a study at different times and length of follow-up time varies for each patient and where time of event is unobserved for some patients due to the study ending

Survival

The tree node estimator is the KM estimator. Tree nodes are split using the log-rank statistic

Key quantities

The key quantities returned by the package are

o$time.interest --->   event times (everything keys off this)

o$predicted     --->   inbag estimated mortality
o$predicted.oob --->   OOB estimated mortality

o$survival      --->   inbag survival estimator for each case
o$survival.oob  --->   OOB survival estimator for each case

o$chf          --->    inbag CHF estimator for each case
o$chf.oob      --->    OOB CHF estimator for each case

Survival example: PBC Mayo Clinic

This data is from the Mayo Clinic trial in Primary Biliary Cirrhosis (PBC) conducted between 1974 and 1984. Among the total 424 PBC patients, the first 312 cases participated in a randomized placebo controlled trial of the drug D-penicillamine with largely complete data. The additional 112 cases did not participate in the clinical trial, but consented to have basic measurements and to be followed for survival [10]

id	time	status	trt	age	sex	ascites	hepato	spiders	edema	bili	chol	albumin	copper	alk.phos	ast	trig	platelet	protime	stage
1	400	2	1	58.76523	f	1	1	1	1.0	14.5	261	2.60	156	1718.0	137.95	172	190	12.2	4
2	4500	0	1	56.44627	f	0	1	1	0.0	1.1	302	4.14	54	7394.8	113.52	88	221	10.6	3
3	1012	2	1	70.07255	m	0	0	0	0.5	1.4	176	3.48	210	516.0	96.10	55	151	12.0	4
4	1925	2	1	54.74059	f	0	1	1	0.5	1.8	244	2.54	64	6121.8	60.63	92	183	10.3	4
5	1504	1	2	38.10541	f	0	1	1	0.0	3.4	279	3.53	143	671.0	113.15	72	136	10.9	3
6	2503	2	2	66.25873	f	0	1	0	0.0	0.8	248	3.98	50	944.0	93.00	63	NA	11.0	3

data(pbc, package = "survival")  
dim(pbc)
> [1] 418  20

Survival example: PBC Mayo Clinic

pbc$id <- NULL ## remove the ID
## keep the original competing risk framework for later
## status at endpoint, 0/1/2 for censored, transplant, dead
pbc.cr <- pbc

## convert to right-censoring with death as the event
pbc$status[pbc$status > 0] <- 1
o <- rfsrc(Surv(time, status) ~ ., data = pbc)
o
                         Sample size: 276
                    Number of deaths: 129
                     Number of trees: 500
           Forest terminal node size: 15
       Average no. of terminal nodes: 14.326
No. of variables tried at each split: 5
              Total no. of variables: 17
       Resampling used to grow trees: swor
    Resample size used to grow trees: 174
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          (OOB) CRPS: 553.15107271
                   (OOB) stand. CRPS: 0.13198546
   (OOB) Requested performance error: 0.18814768

Key quantities

o$time.interest --->   event times (everything keys off this)

o$time.interest[1:5]
> [1]  41  51  71  77 110

o$predicted     --->   inbag estimated mortality
o$predicted.oob --->   OOB estimated mortality

o$predicted[1:5]
> [1] 200.83450  17.07501  76.31131  57.77577  79.27206

o$predicted.oob[1:5]
> [1] 190.04590  28.05044  66.25447  55.03181  72.55308

o$survival      --->   inbag survival estimator for each case
o$survival.oob  --->   OOB survival estimator for each case

dim(o$yvar) ## sample size (complete cases) n = 276
> [1] 276   2
length(o$time.interest)
> [1] 127
dim(o$survival)      # --->   inbag survival estimator for each case
> [1] 276 127
dim(o$survival.oob)  # --->   OOB survival estimator for each case
> [1] 276 127

Key quantities

o$survival      --->   inbag survival estimator for each case
o$survival.oob  --->   OOB survival estimator for each case

dim(o$survival)      # --->   inbag survival estimator for each case
> [1] 276 127
dim(o$survival.oob)  # --->   OOB survival estimator for each case
> [1] 276 127

o$survival[1:5, 1:5]
>           [,1]      [,2]      [,3]      [,4]      [,5]
> [1,] 0.9600951 0.9125329 0.9002843 0.8725504 0.8423145
> [2,] 1.0000000 0.9999167 0.9997543 0.9996710 0.9995136
> [3,] 0.9959351 0.9921702 0.9877700 0.9782302 0.9768478
> [4,] 0.9954334 0.9906795 0.9871662 0.9857532 0.9841598
> [5,] 1.0000000 0.9998401 0.9967281 0.9947621 0.9946241
o$survival.oob[1:5, 1:5]
>           [,1]      [,2]      [,3]      [,4]      [,5]
> [1,] 0.9655403 0.9160407 0.9050698 0.8755058 0.8428749
> [2,] 1.0000000 1.0000000 0.9995281 0.9995281 0.9995281
> [3,] 0.9957842 0.9926934 0.9881592 0.9768084 0.9752430
> [4,] 0.9942668 0.9873056 0.9850606 0.9838238 0.9819504
> [5,] 1.0000000 1.0000000 0.9962979 0.9942830 0.9939347

## plot survival curves for first 10 individuals 
matplot(o$time.interest, 
        100 * t(o$survival.oob[1:10, ]),
        xlab = "Time", ylab = "Survival", type = "l", lty = 1)

Key quantities

o$chf          --->    inbag CHF estimator for each case
o$chf.oob      --->    OOB CHF estimator for each case

dim(o$chf) 
> [1] 276 127
dim(o$chf.oob) 
> [1] 276 127
o$chf[1:5, 1:5]
>             [,1]         [,2]        [,3]         [,4]         [,5]
> [1,] 0.039904941 8.981063e-02 0.103197429 0.1338071668 0.1686493781
> [2,] 0.000000000 8.333333e-05 0.000245671 0.0003326275 0.0004976107
> [3,] 0.004064919 7.977572e-03 0.012491370 0.0227012085 0.0244150195
> [4,] 0.004566626 9.643876e-03 0.013209059 0.0146954141 0.0163487486
> [5,] 0.000000000 1.598746e-04 0.003271896 0.0052661647 0.0054040957
o$chf.oob[1:5, 1:5]
>             [,1]        [,2]         [,3]         [,4]         [,5]
> [1,] 0.034459694 0.085894243 0.0978252512 0.1301985027 0.1682733452
> [2,] 0.000000000 0.000000000 0.0004719118 0.0004719118 0.0004719118
> [3,] 0.004215783 0.007519603 0.0122166452 0.0244321624 0.0264688132
> [4,] 0.005733225 0.013259881 0.0155376855 0.0168153866 0.0187680623
> [5,] 0.000000000 0.000000000 0.0037021133 0.0057619372 0.0061102479

## plot survival curves for first 10 individuals 
matplot(o$time.interest, 
        100 * t(o$chf.oob[1:10, ]),
        xlab = "Time", ylab = "Cumulative hazard", type = "l", lty = 1)

Key quantities

Connection between mortality and the CHF

Mortality for a case i equals the expected number of deaths if all patients were the same as case i. Mortality is obtained by summing the CHF

plot(o$predicted.oob, rowSums(o$chf.oob), xlab="OOB Mortality", ylab="Sum of OOB CHF")
abline(0,1)

Survival example: PBC Mayo Clinic

Split rules

Family	`splitrule`
Regression Quantile Regression	`mse` `la.quantile.regr, quantile.regr, mse`
Classification Imbalanced Two-Class	gini, auc, entropy gini, auc, entropy
Survival	logrank, bs.gradient, logrankscore
Competing Risk	`logrankCR, logrank`
Multivariate Regression Multivariate Classification Multivariate Mixed Regression Multivariate Quantile Regression Multivariate Mixed Quantile Regression	`mv.mse, mahalanobis` `mv.gini` `mv.mix` `mv.mse` `mv.mix`
Unsupervised sidClustering Breiman (Shi-Horvath)	`unsupv` $\{$`mv.mse, mv.gini, mv.mix`$\}$, `mahalanobis` `gini, auc, entropy`

Split rules

log-rank splitting [11] (default)

o <- rfsrc(Surv(time, status) ~ ., data = pbc, 
                  splitrule="logrank") ## default splitrule
o
                         Sample size: 276
                    Number of deaths: 129
                     Number of trees: 500
           Forest terminal node size: 15
       Average no. of terminal nodes: 14.326
No. of variables tried at each split: 5
              Total no. of variables: 17
       Resampling used to grow trees: swor
    Resample size used to grow trees: 174
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrank *random*
       Number of random split points: 10
                          (OOB) CRPS: 553.15107271
                   (OOB) stand. CRPS: 0.13198546
   (OOB) Requested performance error: 0.18814768

Split rules

Gradient-based brier score splitting

o <- rfsrc(Surv(time, status) ~ ., data = pbc, 
                  splitrule="bs.gradient")
o
                         Sample size: 276
                    Number of deaths: 129
                     Number of trees: 500
           Forest terminal node size: 15
       Average no. of terminal nodes: 13.226
No. of variables tried at each split: 5
              Total no. of variables: 17
       Resampling used to grow trees: swor
    Resample size used to grow trees: 174
                            Analysis: RSF
                              Family: surv
                      Splitting rule: bs.gradient *random*
       Number of random split points: 10
                          (OOB) CRPS: 570.11311519
                   (OOB) stand. CRPS: 0.13603272
   (OOB) Requested performance error: 0.19299269

Tip

The time horizon used for the Brier score is set to the 90th percentile of the observed event times. This can be over-ridden by the option prob, which must be a value between 0 and 1 (set to .90 by default)

Split rules

log-rank score splitting [12]

o <- rfsrc(Surv(time, status) ~ ., data = pbc, 
                  splitrule="logrankscore")
o
                         Sample size: 276
                    Number of deaths: 129
                     Number of trees: 500
           Forest terminal node size: 15
       Average no. of terminal nodes: 13.46
No. of variables tried at each split: 5
              Total no. of variables: 17
       Resampling used to grow trees: swor
    Resample size used to grow trees: 174
                            Analysis: RSF
                              Family: surv
                      Splitting rule: logrankscore *random*
       Number of random split points: 10
                          (OOB) CRPS: 626.32256755
                   (OOB) stand. CRPS: 0.14944466
   (OOB) Requested performance error: 0.19231977

The run.rfsrc function for an overview

run.rfsrc(Surv(time, status) ~ ., data = pbc)

Outline

Part I: Training

Quick start
Data structures allowed
Training (grow) with examples
(regression, classification, survival)

Part II: Inference and Prediction

Inference (OOB)
Prediction Error
Prediction
Restore
Partial Plots

Part III: Variable Selection

VIMP
Subsampling (Confidence Intervals)
Minimal Depth
VarPro

Part IV: Advanced Examples

Class Imbalanced Data
Competing Risks
Multivariate Forests
Missing data imputation

References

1. Breiman L. Random forests. Machine Learning. 2001;45:5–32.

2. Ishwaran H, Lu M, Kogalur UB. randomForestSRC: Getting started with randomForestSRC vignette. 2021. http://randomforestsrc.org/articles/getstarted.html.

3. Ishwaran H, Lu M, Kogalur UB. randomForestSRC: Forest weights, in-bag (IB) and out-of-bag (OOB) ensembles vignette. 2021. http://randomforestsrc.org/articles/forestWgt.html.

4. Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC press; 1984.

5. De Cock D. Ames, iowa: Alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education. 2011;19.

6. Ceccarelli M, Barthel FP, Malta TM, Sabedot TS, Salama SR, Murray BA, et al. Molecular profiling reveals biologically discrete subsets and pathways of progression in diffuse glioma. Cell. 2016;164:550–63.

7. Ishwaran H, Lu M, Kogalur UB. randomForestSRC: AUC splitting for multiclass problems vignette. 2022. http://randomforestsrc.org/articles/aucsplit.html.

8. Ishwaran H, Lauer MS, Blackstone EH, Lu M, Kogalur UB. randomForestSRC: Random survival forests vignette. 2021. http://randomforestsrc.org/articles/survival.html.

9. Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. The Annals of Applied Statistics. 2008;2:841–60.

10. Therneau T, Grambsch P. Modeling survival data: Extending the cox model. 2000.

11. Segal MR. Regression trees for censored data. Biometrics. 1988;35–47.

12. Hothorn T, Lausen B. On the exact distribution of maximally selected rank statistics. Computational Statistics & Data Analysis. 2003;43:121–37.